Case Study: Automated AI-assisted Sociological Segmentation

8/8/2025 • Case Study

Atanas Tonchev

In sociological studies, segmentation is used to divide a population into distinct groups based on shared characteristics (e.g., demographics, behaviors, values) so researchers can analyze differences, patterns, and relationships within and between those groups. Our partners at Bright Marketing Research are among the leaders in market and social research. A lot of their studies require the identification of key groups of respondents that allow to follow trends and identify groups of interest. Together, we developed a tool for automated survey segmentation that allows for extensive customization. In this article, we will describe the combination of statistical methods and AI we used to develop our tool.

Segmentation is often used for longitudinal studies, where a model is trained on the first wave of respondents and is then used to predict the segments for all future waves. This makes future analysis far easier by meaningfully simplifying the data. This then allows for much more informative visualizations and analysis.

Segmentation and LDA

The most common method for segmentation is Linear Discriminant Analysis. Once trained, an LDA model provides a way to identify whether a respondent belongs to a certain group based on their responses to a set of predictor variables. More importantly, LDA is a relatively simple model, thus allowing a researcher to easily identify why a respondent has been placed in a certain group and with what certainty.

However, LDA can be difficult to train. First, it uses a limited set of predictor variables (often between 5 and 10), which means that they need to be specified explicitly. This is difficult when working with datasets with hundreds of variables. Another problem is that an LDA is a supervised model, which means that during training it needs a target variable. This often entails that a researcher must first identify predictors, come up with the segment number and characteristics, and finally assign a segment to each respondent. This process is slow and tiresome, but also heavily relies on hunch rather than data.

Predictor Variable Identification through PCA

Identifying reliable predictor variables is key for training a reliable LDA model. While our tool allows for manually specifying such variables, it also provides an unsupervised way to identify the most important variables in the dataset. To calculate the variables that bring most variation within a dataset, the tool uses Principal Component Analysys. Normally, PCA is used for dimensionality reduction: creating synthetic variables that best capture a dataset's variance. This is often used in visualizing complex data.

However, these abstract synthetic variables do not provide the clarity that LDA seeks. LDA requires the human-readable variables that come with a dataset. To this end, we calculate the total contribution of each original variables to the new, synthetic ones. After we combine those, we receive the total contribution of each variable to the dataset's total variance and we pick the first 10 such variables. THis usually is enough to provide a powerful, yet understandable set of predictor variables.

Target Segment Creation through Clustering

Perhaps the more tiresome part when training an LDA model is generating the target variable. Usually, a researcher goes through each respondent from the training wave and decides which segment they would belong to. This is time-consuming and not very precise.

Our tool allows for the use of clustering to generate groups in an unsupervised manner. It generates models with various sizes and allows its user to select an optimal number based on how well the different clusterings have performed. The result is an almost entirely and statistically significant separation of respondents into clear groups.

AI-generated Segment Descriptions

When training LDA models with automatically selected and generated dependent and independent variables, we consistently received accuracies of above 90%. Only one aspect of a researcher's work during segmentation remained to be automated: usually, the manually identified segments are named and described by the researcher going through the responses (e.g. Enthusiastic Active Users, Occasional Users etc.).

To help with this, our tool generates automated descriptions of each cluster based on their most common response to the predictor variables. For example, "Cluster 1 consists of respondents who answered 'daily' to the question 'how often do you use our product?, 'neither agree nor disagree' to the question 'would you say that you enjoyed using our product?' etc. Of course, these descriptions are often too wordy and don't allow for a good overview of each cluster's main characteristics. Thus, we then send these labels to an LLM model, asking for a brief summary of their behaviour and a label that captures it.

Word count: 758