Do you think there are too many questions on your survey? Are you worried that participants may get tired of responding to the questions in the middle of the survey? In this two-part series, I demonstrate how to shorten measurement instruments such as surveys automatically in R. The second part focuses on the use of two optimization algorithms (genetic algorithm and ant colony optimization) for reducing the number of questions in surveys and similar instruments.
(13 min read)
In the social and behavioral sciences, researchers often use surveys, questionnaires, or scales to collect data from a sample of participants. Such instruments provide an efficient and effective way to collect information about a large group of individuals.
Surveys are used to collect information from or about people to describe, compare, or explain their knowledge, feelings, values, and behaviors. (Fink, 2015)
Sometimes respondents may get tired of answering the questions during the survey-taking process—especially if the survey is very long. This is known as survey taking fatigue. The presence of survey taking fatigue can affect response quality significantly. When respondents get tired, they may skip questions, provide inaccurate responses due to insufficient effort responding, or even abandon the survey completely. To alleviate this issue, it is important to reduce survey length properly.
In the first part of this series, I demonstrated how to use automated test assembly and recursive feature elimination to automatically shorten educational assessments (e.g., multiple-choice exams, tests, and quizzes). In the second part, I will demonstrate how to use the following optimization algorithms for creating shorter versions of surveys and similar instruments:
In computer science, a genetic algorithm (GA) is essentially a search heuristic that mimics Charles Darwin’s theory of natural evolution. The algorithm reflects the process of natural selection by iteratively selecting and mating strong individuals (i.e., solutions) that are more likely to survive and eliminating weak individuals from generating more weak individuals. This process continues until GA finds an optimal or acceptable solution.
Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on biologically inspired operators such as mutation, crossover and selection. (Mitchell, 1998)
In the context of survey abbreviation, GA can be used as an optimization tool for finding a subset of questions that maximally captures the variance (i.e., \(R^2\)) in the original data. Yarkoni (2010) proposed the following cost function for scale abbreviation:
\[Cost = Ik + \Sigma^s_i w_i(1-R^2_i)\] where \(I\) represents a user-specified fixed item cost, \(k\) represents the number of items to be retained by GA, \(s\) is the number of subscales in the measure, \(w_i\) are the weights associated with each subscale, and \(R^2\) is the amount of variance in the \(i^{th}\) subscale explained by its items. If the cost of retaining a particular item is larger than the loss in \(R^2\), then the item is dropped from its subscale (i.e., GA returns a shorter subscale). Yarkoni (2010) demonstrated the use of GA in abbreviating lengthy personality scales and thereafter many researchers have used GA to abbreviate psychological scales (e.g., Crone et al. (2020), Eisenbarth et al. (2015), Sahdra et al. (2016))1.
Like GA, the ant colony optimization (ACO) is also an optimization method. ACO was first inspired by the collective behavior of Argentine ants called iridomyrmex humilis (Goss et al., 1989). While searching for food, these ants drop pheromone on the ground and follow pheromone previously dropped by other ants. Since the shortest path is more likely to retain pheromone, ants can follow this path and find promising food sources more quickly. Figure 1 illustrates this process.
Engineers decided to use the way Argentine ant colonies function as an analogy to solve the shortest path problem and created the ACO algorithm (Dorigo et al., 1996; Dorigo & Gambardella, 1997). Then, G. A. Marcoulides & Drezner (2003) applied ACO to model specification searches in structural equation modeling (SEM). The goal of this approach is to automate the model fitting process in SEM by starting with a user-specified model and then fitting alternative models to fix missing paths or parameters. This iterative process continues until an optimal model (e.g., a model with good model-fit indices) is identified. Leite et al. (2008) used the ACO algorithm for the development of short forms of scales and found that ACO outperformed traditionally used methods of item selection. In a more recent study, K. M. Marcoulides & Falk (2018) demonstrated how to use ACO for model specification searches in R.
In this example, we will use the Experiences in Close Relationships (ECR) scale (Brennan et al., 1998). The ECR scale consists of 36 items measuring two higher-order attachment dimensions for adults (18 items per dimension): avoidance and anxiety (see Figure 2)2. The items are based on a 5-point Likert scale (i.e., 1 = strongly disagree to 5 = strongly agree). For each subscale (i.e., dimension), higher scores indicate higher levels of avoidance (or anxiety). Individuals who score high on either or both of these dimensions are assumed to have an insecure adult attachment orientation (Wei et al., 2007).
Wei et al. (2007) developed a 12-item, short form of the ERC scale using traditional methods (e.g., dropping items with low item-total correlation and keeping items with the highest factor loadings). In our example, we will use the GA and ACO algorithms to automatically select the best items for the two subscales of the ECR scale. The original data set for the ERC scale is available on the Open-Source Psychometric Project website. For demonstration purposes, we will use a subset (\(n = 10,798\)) of the original data set based on the following rules:
In the ERC scale, some items are positively phrased and thus indicate lower avoidance (or anxiety) for respondents. Therefore, these items (items 3, 15, 19, 22, 25, 29, 31, 33, 35) have been reverse-coded3. Lastly, the respondents with missing responses have been eliminated from the data set. The final data set is available here.
Now, let’s import the data into R and then preview its content.
ecr <- read.csv("ecr_data.csv", header = TRUE)