My lab recently published a paper in Social Psychological and Personality Science, which I encourage you to read for all the details, but I'd also like to talk a bit about it here as well. The page limits of the article did not allow me to go into all the depth with which I wanted to. In this paper, we attempted to validate a standardized version of the Taylor Aggression Paradigm.
What the Hell is the Taylor Aggression Paradigm?
Psychology has long agreed that aggression is an important phenomenon to study, but it's proven very tricky to do so. The majority of the literature has consisted of self- and peer-reports of aggression in survey research. However, to determine the causal forces behind human aggression, we needed to bring these assessments into the laboratory and detach them from the biases inherent in such reports.
In the 1950's and 60's, the scholars Epstein and Taylor developed a paradigm in which participants arrived to the laboratory, were set against an opponent in a competitive task, and were allowed to inflict harm on that opponent as part of the task (e.g., shock them with varying amounts of electricity). This general model became the Taylor Aggression Paradigm (hereafter TAP), which has enjoyed wide popularity and substantial modification.
The TAP has proven to be a target of controversy for lots of reasons, many of which center on the debate regarding violent media and aggression, which I give a wide fucking berth (not my monkey, not my circus). Others have also provided evidence for the TAP's internal and external validity.
The TAP is often the butt of jokes, where others mock the use of noise blasts (or pins stuck in a voodoo doll, or hot sauce dumped on saltines, or dunking a hand under ice-water) as ludicrous operationalizations of harm-doing. Given the ethical concerns of the lab, these forms of aggression are about as much as we're allowed to do. I'd love to find a harmless, ethical way to host a scientifically-valid Psychology Fight Club (see below), but it's just not feasible. If anyone has a better way of measuring aggression in the lab, I am all ears. Because aggression is so costly, it's incredibly important that we get this right, that we measure aggression in an accurate, reliable, and replicable way.
In the 1950's and 60's, the scholars Epstein and Taylor developed a paradigm in which participants arrived to the laboratory, were set against an opponent in a competitive task, and were allowed to inflict harm on that opponent as part of the task (e.g., shock them with varying amounts of electricity). This general model became the Taylor Aggression Paradigm (hereafter TAP), which has enjoyed wide popularity and substantial modification.
The TAP has proven to be a target of controversy for lots of reasons, many of which center on the debate regarding violent media and aggression, which I give a wide fucking berth (not my monkey, not my circus). Others have also provided evidence for the TAP's internal and external validity.
The TAP is often the butt of jokes, where others mock the use of noise blasts (or pins stuck in a voodoo doll, or hot sauce dumped on saltines, or dunking a hand under ice-water) as ludicrous operationalizations of harm-doing. Given the ethical concerns of the lab, these forms of aggression are about as much as we're allowed to do. I'd love to find a harmless, ethical way to host a scientifically-valid Psychology Fight Club (see below), but it's just not feasible. If anyone has a better way of measuring aggression in the lab, I am all ears. Because aggression is so costly, it's incredibly important that we get this right, that we measure aggression in an accurate, reliable, and replicable way.
Combating Flexibility Issues with Preregistration
Like almost all other psychometric paradigms, the TAP can be implemented, scored, and analyzed in a flexible manner. This can be a blessing, allowing researchers to tailor the task to a given experimental setting or hypothesis, but can also be a curse, when researchers can misuse this flexibility to achieve illusory support for hypotheses by capitalizing on chance as they test a given prediction across numerous scoring and analytic regimes. This misuse of the TAP appears to be rampant. I'm not saying that this is the case for any given body of work or scholar, I'll leave that determination to you, my esteemed Reader.
The TAP has emerged as the primary target of this debate about the role of flexibility in undermining sound science, but before you throw your personal pitchfork at the task, ask yourself whether a psychometric instrument you use could pass the same bar. Any brief survey of the literature will show you that many of the most popular tasks (e.g., Stroop, Go/No-Go, Mind In The Eyes) are implemented, scored, and analyzed in highly flexible ways. This flexibility often applies to questionnaires too (e.g., subscales vs. total scores, long-form vs. short-form, retain vs. drop items that bring alpha below .70).
So what is one to do? Simple, preregister your hypotheses, implementation plan, scoring plan, and analytic strategy, and stick to it as best you can, and that's what we did across 2 studies. I encourage you to check out our preregistration plans for Study 1 and Study 2, and our data/code/materials.
The TAP has emerged as the primary target of this debate about the role of flexibility in undermining sound science, but before you throw your personal pitchfork at the task, ask yourself whether a psychometric instrument you use could pass the same bar. Any brief survey of the literature will show you that many of the most popular tasks (e.g., Stroop, Go/No-Go, Mind In The Eyes) are implemented, scored, and analyzed in highly flexible ways. This flexibility often applies to questionnaires too (e.g., subscales vs. total scores, long-form vs. short-form, retain vs. drop items that bring alpha below .70).
So what is one to do? Simple, preregister your hypotheses, implementation plan, scoring plan, and analytic strategy, and stick to it as best you can, and that's what we did across 2 studies. I encourage you to check out our preregistration plans for Study 1 and Study 2, and our data/code/materials.
Findings
We used a computerized version of the TAP that measured aggression as the volume and duration with which participants decided to administer blasts of a very uncomfortable noise (think of a cat getting sucked into a jet turbine) across 25 trials. Across the task, their opponent (a computer program) initially and then repeatedly provoked them by selecting loud and long noise blasts to administer.
Across both studies, we averaged scores across all trials of the TAP, driven by the logic that a greater number of measurements will yield a more accurate and reliable estimate of aggression.
We found that louder noise blasts on the TAP corresponded to greater aggression on two other canonical aggression measures: the Voodoo Doll Aggression Task and the Hot Sauce Aggression Task (see figure below), as well as a self-report measure of trait physical aggression. Thus, the task exhibits convergent validity with other aggression measures, which is great! However, showing that one 'contrived' laboratory aggression measure (as described by several of my reviewers) corresponds to two other ones, isn't enough evidence to claim that the task is a solid aggression measure.
We found that louder noise blasts on the TAP corresponded to greater aggression on two other canonical aggression measures: the Voodoo Doll Aggression Task and the Hot Sauce Aggression Task (see figure below), as well as a self-report measure of trait physical aggression. Thus, the task exhibits convergent validity with other aggression measures, which is great! However, showing that one 'contrived' laboratory aggression measure (as described by several of my reviewers) corresponds to two other ones, isn't enough evidence to claim that the task is a solid aggression measure.
If you want to evoke aggression in someone else, what would you do? Many of you would probably say 'insult them' and that is exactly what we did to our participants. We told half of them that an essay they wrote was garbage, a great way to evoke aggression from students, and ethical enough to use in the lab. As we predicted, doing so increased aggressive behavior on the TAP, providing evidence for the task's construct validity. <<Props to the R package vioplot for the figure below>>
One of the most debated topics around the TAP is its external/predictive validity. Can the TAP predict who is aggressive in the real-world? To try and determine whether this was the case, we asked people how many physical fights they had been in across varying time spans. We got some seriously mixed results, with TAP scores being associated with greater physical fight frequency over the past year and 'ever', but not over the past 5 years. I'm not really sure what to make of these mixed results, so I'm just going to leave the external validity of the task as 'currently unknown' in my book. I don't think that my measure was an ideal assessment of real-world violence (e.g., being in a fight doesn't mean you started it), so I want to more rigorously approach this issue in the future.
To assess the task's discriminant validity, I wanted to identify variables that were similar to aggression but conceptually-distinct to ensure that the TAP was capturing aggression and not something else. First, I chose *verbal* aggression as the TAP is a physical aggression measure and therefore shouldn't also capture other forms of aggression. Second, I chose self-harm, as the TAP should capture the tendency to harm others and not the self. Across, both studies there were weak and marginal associations between scores on the TAP and these two variables. This could either mean that the TAP does not exhibit good discriminant validity, or even more likely, that I picked 2 imperfect variables for this purpose. Both of them correlate with physical aggression to a reliable degree. In the future, the discriminant validity of the TAP needs to be investigated with variables that are both conceptually-distinct from and uncorrelated with physical aggression.
Tasks with multiple assessments also need to be internally-consistent, or else the lack of reliability undermines any inferences gleaned from the task. Principal components analysis showed that the 50 TAP measurements largely loaded onto a single component (see scree plot below), which suggests that the aggregate scoring approach we took was appropriate (though see below for reasons why that may not be that case).
To assess the task's discriminant validity, I wanted to identify variables that were similar to aggression but conceptually-distinct to ensure that the TAP was capturing aggression and not something else. First, I chose *verbal* aggression as the TAP is a physical aggression measure and therefore shouldn't also capture other forms of aggression. Second, I chose self-harm, as the TAP should capture the tendency to harm others and not the self. Across, both studies there were weak and marginal associations between scores on the TAP and these two variables. This could either mean that the TAP does not exhibit good discriminant validity, or even more likely, that I picked 2 imperfect variables for this purpose. Both of them correlate with physical aggression to a reliable degree. In the future, the discriminant validity of the TAP needs to be investigated with variables that are both conceptually-distinct from and uncorrelated with physical aggression.
Tasks with multiple assessments also need to be internally-consistent, or else the lack of reliability undermines any inferences gleaned from the task. Principal components analysis showed that the 50 TAP measurements largely loaded onto a single component (see scree plot below), which suggests that the aggregate scoring approach we took was appropriate (though see below for reasons why that may not be that case).
TAP scores were no different between self-identified males and females, which were admittedly not sampled equally. This was very surprising, given the well-established higher rates of physical violence among males. However, the effect of gender on aggression is not as simple as we once thought, and these findings may reflect this complexity (or not).
Exploratory Analyses
These datasets allowed us to examine some other, un-preregistered questions, many of which were suggested by our incredible editor and reviewers.
We used structural equation modeling to examine whether the aggregate scoring approach showed good fit to the data. Sadly, modeling a single latent factor that each of the TAP's 50 measurements loaded onto showed pretty crappy model fit. We tried to improve things by separately modeling the first 2 measurements as a separate factor, as several aggression researchers have told me that the first trial of the TAP is a 'clean' measure of 'unprovoked' aggression as it precedes the provocation that is often built into the first exchange between participants and their opponents on the task. Doing so didn't help. Neither did modeling the second trials as its own factor (as some say that Trial 2 is a 'clean' measure of 'provoked' aggression, as it immediately follows the opponents' pre-programmed provocation). Divvying up the measurements into a volume and a duration latent factor showed the best, though still poor, model fit. Perhaps this suggests that we should separately model volume and duration settings from the task, but the fact that they correlate at r = .93 tells me that the results between the two factors won't meaningfully differ.
We also used SEM to examine the taxonomy of aggression measures. First, we examined whether our self-reported and behaviorally-assessed measures of aggression loaded onto a single aggression component (left figure panel below). This approach didn't fit the data well, so we tried it again modeling self-reported and behavioral aggression as separate, though correlated, factors. This model fit the data much better (right figure panel below).
We used structural equation modeling to examine whether the aggregate scoring approach showed good fit to the data. Sadly, modeling a single latent factor that each of the TAP's 50 measurements loaded onto showed pretty crappy model fit. We tried to improve things by separately modeling the first 2 measurements as a separate factor, as several aggression researchers have told me that the first trial of the TAP is a 'clean' measure of 'unprovoked' aggression as it precedes the provocation that is often built into the first exchange between participants and their opponents on the task. Doing so didn't help. Neither did modeling the second trials as its own factor (as some say that Trial 2 is a 'clean' measure of 'provoked' aggression, as it immediately follows the opponents' pre-programmed provocation). Divvying up the measurements into a volume and a duration latent factor showed the best, though still poor, model fit. Perhaps this suggests that we should separately model volume and duration settings from the task, but the fact that they correlate at r = .93 tells me that the results between the two factors won't meaningfully differ.
We also used SEM to examine the taxonomy of aggression measures. First, we examined whether our self-reported and behaviorally-assessed measures of aggression loaded onto a single aggression component (left figure panel below). This approach didn't fit the data well, so we tried it again modeling self-reported and behavioral aggression as separate, though correlated, factors. This model fit the data much better (right figure panel below).
Because the self-reported aggression measure was also a trait aggression measure, it's impossible to know whether these findings mean that self-reports and behavioral assessments of aggression are meaningfully different, or if it simply comes down to state vs. trait. However, the model does show that TAP scores had the highest loading onto the behavioral aggression factor, followed closely by pin counts from the Voodoo Doll Aggression Task, and far behind was the Hot Sauce Aggression Task. Though exploratory, these findings might suggest that the TAP is a superior aggression measure to the other two.
We also examined the presence of curvilinear effects of various variables on TAP scores (see below), finding two. As TAP scores increased the association with voodoo doll pin counts became negative and the association with hot sauce allocations became more positive. These conflicting data patterns are hard to interpret, but might simply reflect that the TAP scores become less accurate at their extreme. Indeed, participants who only set the volume and duration at the maximal value on every trial may not be taking the task seriously or are trying to 'troll' the researchers. We should perhaps be cautious of findings obtained only using the 'extreme aggression' scores from the TAP. However, curvilinear effects are notoriously difficult to detect in non-large samples, so perhaps these finnicky results are due to that simple statistical issue.
We also examined the presence of curvilinear effects of various variables on TAP scores (see below), finding two. As TAP scores increased the association with voodoo doll pin counts became negative and the association with hot sauce allocations became more positive. These conflicting data patterns are hard to interpret, but might simply reflect that the TAP scores become less accurate at their extreme. Indeed, participants who only set the volume and duration at the maximal value on every trial may not be taking the task seriously or are trying to 'troll' the researchers. We should perhaps be cautious of findings obtained only using the 'extreme aggression' scores from the TAP. However, curvilinear effects are notoriously difficult to detect in non-large samples, so perhaps these finnicky results are due to that simple statistical issue.
Conclusions and Future Directions
First things first, this was a flawed and preliminary first-go. The manuscript details ways in which things went wrong during data collection and ways in which we deviated from the preregistration plan. Ideally, I'd do the whole thing again without any deviations from the preregistration or errors. These aside, I think our findings offer cautious optimism for the use of the TAP (assuming that its implementation, scoring, and analysis are preregistered). The TAP seems to agree with other measures of the same thing and show reliable reactivity to provocation. The evidence for the TAP's external and discriminant validity are mixed, but this is likely due to poor psychometric choices on behalf of yours truly that I've detailed above. Overall, the evidence support the use of an aggregate scoring approach to the TAP. It is completely uncertain how other scoring approaches might fare, though you should check out this preprint that may offer some insight.
Like almost any psychological measure, the TAP and its inherent flexibility, can be misused. That doesn't mean the task itself is flawed. You don't blame the car when an absent-minded driver flips over a highway median, and in that way you shouldn't blame the TAP for operator error. In this project, we tried to take such error off the table by preregistering our practices (with mixed success). In doing so, I think we have shown some imperfect, preliminary evidence that this approach to the TAP itself is alright, which places the mantle of responsibility at the feet of the investigators, where it should be. Now it's our job to respect this tool and do things the right way.
Like almost any psychological measure, the TAP and its inherent flexibility, can be misused. That doesn't mean the task itself is flawed. You don't blame the car when an absent-minded driver flips over a highway median, and in that way you shouldn't blame the TAP for operator error. In this project, we tried to take such error off the table by preregistering our practices (with mixed success). In doing so, I think we have shown some imperfect, preliminary evidence that this approach to the TAP itself is alright, which places the mantle of responsibility at the feet of the investigators, where it should be. Now it's our job to respect this tool and do things the right way.