Blog

Contingent Validity: Experimental Manipulations Deserve A More Prominent Role in Validating Psychological Trait Scales

4/7/2024

Thanks to the marvelous efforts of folks like Jessica Flake and Eiko Fried, I don't need to convince anyone reading this that we have to take the validation process for psychological self-report scales seriously*. There are many ways to validate such measures, which have been expertly detailed elsewhere. In this post, I will focus on an oft neglected form of scale validation, which I term contingent validity --- which reflects whether a scale's criterion validity is contingent upon theoretically-appropriate situational conditions.

I've been on sabbatical and part of my scholarly leave activities have been to read older texts on the topics I study. One book from the early 1980s on the measurement of aggression repeatedly cited examples of various scale validation attempts to centered on the use of experimental manipulations to either (A) impact the scale's score directly or (B) moderate the link between the scale's score and a criterion measure. The former should be applied to state scales that assess momentary and transient constructs, whereas the latter should be applied to trait scales that measure durable constructs and it will remain our focus here for this introduction to contingent validity**.

Some examples I read were:
Example 1 --- To validate a Hostility Scale, a group of investigators tested whether people who scored higher on this hostility measure would say more and more hostile words when others reinforced (versus punished) them for using such words. (IV: Hostility Scale score, DV: hostile word count, Moderator: reinforcement vs. punishment).

Example 2 --- For a Different Hostility Scale, investigators tested whether people who scored higher on this hostility measure would identify with more hostile traits after they were exposed to an arousing (versus non-arousing) stimulus. (IV: Different Hostility Scale score, DV: hostile trait identification, Moderator: arousal vs. non-arousal).

In my reading of texts from around this same time, it became clear that using experimental manipulations to examine the construct validity of scales was commonplace. Such practices are now mostly absent from the literature as far as I can tell.

The use of these manipulations was clearly motivated by the investigators' desire to test whether the scale they developed would perform as their theories suggested it should (i.e., predict more hostile behavior when reinforced or sympathetically aroused). Though these validation studies preceded the person-situation debate, they were driven by the logic that this controversy eventually bestowed (i.e., that personality -- and valid measures thereof -- are best understand in the context of the environment).

*My own efforts to convince folks about the dire need to better validate our experimental manipulations have been met with less enthusiasm and success.

**The authors of those scales from the mid-20th century didn't seem to make this distinction, often failing to articulate whether a given measure was of a trait or a state, and thus they often examined the direct and moderating impacts of experimental manipulations interchangeably, something that they shouldn't have done.

Cognitive-Affective Processing System (CAPS): Personality As Situational Contingencies

Personality is a bit of a conundrum in that it is both stable across situations but also dramatically affects how you respond to a given situation. My favorite graduate seminar was on personality psychology and focused heavily on this issue. One of the best theories we discussed was Mischel & Shoda's Cognitive-Affective Processing System (CAPS) Theory. This theory arose out of the ashes of the person-situation debate and has numerous elements that explain how personality and situations interact to explain human behavior. My favorite of these proposed theoretical elements is circled in red below, the if-then profiles that arise as a product of the CAPS. These if-then profiles reflect 'distinctive and stable' patterns of behavioral responses to specific situations. For example, some people tend to respond to situations where they feel insulted with anger and aggression whereas others would tend to respond with fear and avoidance. Some people tend to respond to others' need for help with confidence and action, whereas others tend to respond with uncertainty and inaction. It is these if-then profiles that can be leveraged to examine a scale's contingent validity.

Adapted from Mendoza-Denton & Mischel (2010)

Contingent Validity
To understand contingent validity, we must understand one of its key ingredients -- criterion validity. Criterion validity, which can be assessed in a predictive or concurrent manner, refers to whether a given scale's score is associated with a theoretically-appropriate outcome (usually a behavioral outcome). For instance, a reactive aggressiveness scale with sufficient criterion validity will produce scores that are positively associated with the number of, say, violent crimes someone has committed or will commit.

Given that such behavioral manifestations of personality should often be contingent on the situation (e.g., the if-then profiles articulated in the CAPS theory), then a valid measure should reflect such situational contingency (i.e., should exhibit evidence of contingent validity). This should be especially true for traits that are especially situationally contingent in their theoretical definitions. Reactive aggression, for example, is characterized by a tendency to respond to provocations with impulsive levels of aggression. As such, the link between scores on a valid reactive aggressiveness scale and violent behavior should be amplified by situations characterized by interpersonal provocation. This can be tested in a simple moderation model depicted below.

Future Directions
I hope for a return of the field to the practices of contingent validity. I'd like nothing more than to review and read scale validation papers that place situational variables in a key role. I think a lot of this work is going on, but under the auspices of substantive hypothesis testing instead of validation efforts. I hope such studies will grow in quantity and quality and be given their proper home in the realm of psychometrics.

I think experimental manipulations are often ignored or actively avoided by personality and other assessment-oriented psychologists. Perhaps because they are perceived as irrelevant to scale validation (a perception I hope to have combated here), or perhaps because of their association with Questionable Manipulation Practices (QMAPS), or maybe other reasons. But at the end of the day, measurement and manipulation skills should be in the toolkit of every psychological researchers, especially those

Just as personality and clinical folks have admonished experimental psychologists about their lacking psychometric skills, I hope us experimental folks will respectfully advocate for assessment-oriented folks to adopt experimental manipulation approaches as well.

That said, there is no need to rely purely on experimental manipulations to examine contingent validity. One could easily examine situational contingency using correlational approaches. One could easily test whether a scale's criterion validity is altered among people who tend to experience more or less of the situation (cross-sectional approach) or whether such criterion validity fluctuates within-participants as a function of the given situation's presence or absence over time (repeated-measures approach).

Yet given the many advantages of well-validated manipulations, the experimental approach deserves a prominent role in the estimation of contingent validity. I hope to see such a trend soon.

0 Comments

A Pharmaceutical Approach to Manipulation Validation

7/6/2020

3 Comments

As part of my ongoing series on the validation of experimental manipulations, I want to propose that we adopt some techniques from pharmacology to improve the practices of experimental psychology.

Pharmaceutical researchers conduct drug trials, in which they administer drugs that contain specific active ingredients (i.e., the substances in a given drug that exert the intended therapeutic effect) alongside secondary ingredients that serve non-therapeutic purposes. There are many secondary ingredients in each drug, but they exist to serve ancillary functions like binding a pill together or allowing for stable metabolism of the active ingredient.

Just as pharmacists explicitly label the active ingredients in each drug, experimenters should identify the active ingredient(s) in their psychological manipulations and explicitly differentiate them from the other aspects of the manipulation that serve secondary roles such as those that serve to bolster the cover story or distract from the deceptive elements of the procedures.

For example, the Cyberball paradigm has participants believe they are interacting with at least 2 other people over the internet, who they will play a simple ball-tossing task with (the other people don't actually exist and are simulated by a computer program). The task is then framed as a mental visualization exercise to get participants to imagine the ball toss 'as if it were happening in real life'. This cover story is a distraction from the key aspect of the manipulation, which is that participants will either receive an equal amount of ball tosses as their 2 compatriots, or will be excluded from the toss by their compatriots who will just toss the ball back-and-forth to each other. In this case, the cover story about mental visualization and the deceptive elements about interacting with other people are secondary ingredients in the manipulation. The active ingredient is the ball-tosses that participants are excluded from. This specific part of the manipulation (i.e., the excluded ball-tosses) is the active ingredient of the manipulation.

The reason to identify these active ingredients is crucial for a validation process I will explain in a moment.

In medicine, it is also crucial to report the dosage of the active ingredient, typically in milligrams. This dosage information is critical to have in order to understand the drug's potency and therapeutic efficacy. Experimental psychologists should attempt to do the same. Once the active ingredient is identified, the units should be articulated (when possible). For Cyberball, each unit of the active ingredient is 1 excluded ball toss. For a fear-learning paradigm, it may be 1 scary image.

These units may often be arbitrary, such as in the case of a humor induction in which participants watch a funny video. The units here could be arbitrary lengths of the video (e.g., 10 second video segments).

For some manipulations, this unit-ification process may be impossible. For example, in one provocation manipulation, participants are given either harsh or pleasant feedback from someone else on an essay they wrote. They receive either a low or high score from their essay rater, which would allow for the creation of some units (i.e., 1 point removed from the total score), but they also receive written feedback (i.e., "WORST essay I've ever read!" or "GREAT essay!"). How could such text be placed into units? In some cases it will simply be impossible, but experimenters should try to develop manipulations that can be articulated in clear units whenever possible, so that they can conduct a dose-response curve.

Identifying the active ingredient in your manipulation and its units allows you to take the next step, creating a dose-response curve. Pharmaceutical researchers administer new drugs at varying doses so that a dose-response curve can be modeled. This allows them to identify (A) at what minimum dosage is there a therapeutic effect, (B) at what dosage is that effect optimized, and (C) at what dosage does the effect level off (i.e., does greater dosage produce no meaningful increase in therapeutic effect).

This is crucial information for experimental psychologists to have as well! For each manipulation, we should know how many units of the active manipulation ingredient are needed to elicit a sufficient manipulation effect on our target psychological process. We should also know how many units of that active manipulation ingredient create the largest (i.e., optimal) effect and at what levels of the manipulation do we stop seeing a corresponding increase in the effect.

To do this, you simply need to administer the experimental condition of your manipulation at varying doses (i.e., units) of the manipulation's active ingredient, holding all secondary ingredients constant. You can then plot the standardized effect of each 'dose' of your manipulation against participants' baseline to create a dose-response curve for your manipulation. This process requires that you administer valid measures of your target construct before and after each level of the manipulation is induced. This is akin to a process seen often in neuroimaging and other disciplines called parametric modulation.

How many different levels of the manipulation should be included? This is up to each experimenter, but more is better. Including more levels will give you a more granular and fine-grained dose-response curve.

No control conditions need to be included in this process, just the experimental one. By manipulating just the parts of the manipulation that you expect to have an effect, holding secondary aspects constant, you can be sure that the part of your manipulation that you intend to have the effect is exerting the effect and not some other aspect.

Identifying the dose-response survey of your manipulation will allow you to select the optimal number of units of your active manipulation ingredient to administer to participants. Doing so will ensure that your manipulation is strong enough to have the desired effect while simultaneously reducing participant and experimenter burden by avoiding excessively large 'doses' that many consume valuable time and resources. Further, if you are manipulation an aversive or sensitive psychological state (e.g., pain), such dose-response curves will allow you to select the experimental dosage that does not induce excessive amounts of that state to the point where it may be unethical and harmful to do so to your participants.

Weighing these concerns will make each study's optimal level of manipulation dosage unique. Though a quick rule-of-thumb used by pharmaceutical researchers is to first identify the drug's maximum effect and then find the dosage that corresponds to 50% of that maximum effect (indicated by the X line in the figure below). Experimenters could select dosages of their manipulations that approximate the 50% mark, though again, this may be different for each study.

Another pharmaceutical practice is to measure and plot dose-response curves for drugs' various unwanted side-effects. Just as drugs may induce undesired physical symptoms, experimental manipulations often induce psychological states that are ancillary to the manipulation's intended effect (at best) and may undermine the efficacy of the manipulation (at worst). Thus, it is crucial to identify psychological variables that you do not desire your manipulation to induce and to test these alongside the target variable(s).

If the side effects become stronger than your desired effect at a certain dosage, then experimenters will want to stay below that dosage to avoid interference between their manipulation's intended and unintended effects.

Experimental psychologists aren't pharmacists or medical doctors. We shouldn't try and mirror their practices just for the sake of superficial credibility. Yet the dose-response curve approach to titrating the levels of our experimental manipulations' active ingredients is one case where we may stand to gain a lot. The psychological states we manipulate are impactful (or else we wouldn't bother manipulating them) and we want to make sure our manipulations are optimized to protect our participants, yield robust effects, avoid the interference of unwanted side effects, and don't over-burden ourselves or our participants. A dose-response approach allows us to do just that and I hope to see such titration experiments in the literature in future days.

3 Comments

Validating Experimental Manipulations w/ Passive Control Conditions

6/24/2020

0 Comments

The final version of the Measurement Schmeasurement paper was released by JK Flake and Eiko Fried, which expertly highlights psychology's measurement crisis surrounding Questionable Measurement Practices (QMPs). I get the vibe that folks have largely been persuaded that this crisis is real, QMPs exist and should be eliminated, and the psychometric practices of our field need an overhaul. At the same time, I've been trying to ring the bell about similar shortcomings in our field's Questionable MAnipulation Practices (QMAPS) with much less success. I'm not sure if that's just due to my own failings as a science advocate or if folks don't think there's a big problem with how we're approaching the manipulation of psychological variables. For anyone who remains unconvinced, there's a new paper in Perspectives on Psychological Science provides more evidence for just how bad things might be.

This paper critiqued the power posing literature. Power posing studies manipulate powerful postures by asking participants to either adopt an expansive body stance (see below) or a control stance. In the control condition, participants adopt a contractive stance (e.g., hugging yourself, head down, legs pulled together).

Image acquired from Wikimedia Commons

These manipulations are then used to show that such 'power poses' make people feel and behave more assertively, inter alia. But the use of an active control condition (i.e., a control condition where the participant completes a task that is intended to induce a dissimilar psychological state from the experimental condition) prevents us from knowing what participants would have done and felt in the absence of instructions to strike a pose. As such, we don't know which of the two conditions is driving the manipulation's effect. The relatively greater assertiveness observed among those who strike an expansive (versus contractive) pose is often interpreted as being attributable to the expansive posture condition (i.e., the power pose), but it's just as likely that the contractive posture (i.e., the active control condition) leads people to be less assertive!

There are 3 possible configurations for how a promising difference between an experimental condition and active control condition might arise. In Scenario A (see below), both the experimental and active control conditions exert their intended effects. Specifically, participants in the experimental condition (e.g., expansive posture) show a +0.5 increase from baseline (i.e., the passive control condition) and participants in the active control condition (e.g., constrictive posture) show a -0.5 decrease from baseline, for a mean difference of 1.0 between the experimental and active control conditions.

In Scenario B (see below), only the experimental condition exerts its intended effect. Specifically, participants in the experimental condition (e.g., expansive posture) show a +1.0 increase from baseline (i.e., the passive control condition) and participants in the active control condition (e.g., constrictive posture) show no change from baseline, for a mean difference of 1.0 between the experimental and active control conditions.

In Scenario C (see below), only the active control condition exerts its intended effect. Specifically, participants in the experimental condition (e.g., expansive posture) show no change from baseline (i.e., the passive control condition) and participants in the active control condition (e.g., constrictive posture) show a -1.0 decrease from baseline, for a mean difference of 1.0 between the experimental and active control conditions.

All 3 scenarios produce the same observed difference between experimental and control conditions, but from meaningfully different patterns of effects. Scenarios A and B are what most investigators want, as the experimental condition is having the intended effect. Yet without a 'neutral', passive control condition that captures participants' baseline, experimenters cannot know if their effect is actually a reflection of Scenario C.

The new meta-analysis sought to examine which of these 3 scenarios was most likely present in the power posing literature. The authors did so by examining power posing studies that also included 'neutral' conditions, in which participants were not instructed to strike either an expansive or contractive posture. The meta-analysis compared expansive and contractive posture conditions against this neutral condition to test which of these two conditions was doing the heavy-lifting. Counter to claims from the power posing literature, the meta-analysis found that comparing the expansive posture condition to the neutral condition returned a null overall effect, g = .06, p = .197. The meta-analysis went on to find that comparing the contractive control condition to the neutral condition returned a large overall effect, g = .45, p < .001. These findings suggest that the difference between the experimental and active control conditions of power pose manipulations was due to effects from the active control condition and not the experimental condition (i.e., Scenario C)!

The main take home from this paper is best stated by the authors: "...the results point to the importance of including a neutral control condition in experimental studies of the effect of manipulating motor displays, allowing the effect to be ascribed to the appropriate condition." Manipulations refer to specific assemblages of experimental conditions and a manipulation cannot be said to be validated unless it has been tested alongside both passive and active control conditions and the results reflect Scenarios A or B.

So what should a passive control condition look like? This will be a unique consideration for each manipulation, but the best advice may be just to always include a third condition in which participants skip as much of the manipulation procedures as possible. For instance, Cyberball is a widely-used social exclusion manipulation that most frequently compares an experimental exclusion condition (where participants are left out of a ball-toss) with an active inclusion control condition (where participants receive an equal number of ball-tosses). A passive control condition might just have participants skip the manipulation task completely and immediately begin on whatever procedures follow the ball-tossing task.

Skipping such manipulation procedures may not always be possible (e.g., the Cyberball task is a necessary setup for the subsequent task). In such cases, it's likely best to make the passive condition as 'neutral' as possible. More specifically, experimenters should seek to make the passive control condition replicate the routine, mundane, expected, and run-of-the-mill experiences of everyday life. Doing so allows participants to engage in the psychological processes that are independent of the constraints and demands of experimental manipulations, thus approximating a baseline comparator for the other conditions.

0 Comments

An Argument for Retaining 'Suspicious' Participants in Deception Studies

3/10/2019

0 Comments

Experimental psychologists often deceive their participants about certain parts of their studies. They may provide false cover stories, simulate interactions with other people who are actually computer programs, or give them fake feedback on an intelligence test. Researchers who employ deception also often include a 'post-experimental inquiry', designed to suss out whether participants detected the deceptive elements of the study and/or guessed the study's true purposes. These are often referred to as suspicion probes and can take many forms.

Experimentalists are often trained that suspicion probes and the exclusion of suspicious participants is necessary to doing valid experimental research. Doctoral students read quotes like "It is impossible to overstate the importance of the post-experimental follow - up. . . .the experimenter needs to learn if the deception was effective or if the participant was suspicious in a way that could invalidate the data based on his or her performance in the experiment."

Yet might this approach of identifying and excluding 'suspicious' participants actually cause problems? I argue that it might.

In a survey of 77 social psychologists, over 97% of experimental social psychologists who employ deception also included a suspicion probe. Of these investigators, 84% administered the probe prior to debriefing, which is likely motivated by a desire to assess suspicion before participants are given the full information about the study; 10% administered probe before and after the debriefing. The probes took the form of verbal interviews between experimenters and participants (57%), computerized surveys (23%), and/or paper-and-pencil surveys (21%). The number and content of the questions in these interviews/surveys varied wildly, with no standardization across groups. When participants met the researchers' (unstandardized) criteria for being 'suspicious', 58% discarded these participants, whereas 27% included these suspicion ratings as a statistical covariate in their analyses.

As you can see from this survey, suspicion probing is a wild and lawless frontier that enables researcher degrees-of-freedom as investigators make undisclosed and/or unjustified decisions along the garden of forking paths.

As outlined above, suspicion probes are often unstandardized and largely unvalidated measures that are idiosyncratic to a given laboratory or investigator.

Even when experimenters intentionally and covertly delivered deception-related information about the study to participants, these participants didn't reveal that information during a subsequent suspicion probe. And this finding is repeatedly replicated even when participants are rewarded for reporting their suspicions. This suggests that suspicion probes do not reliably detect suspicion even when it does exist and are thus, not valid.

Trained experimenters are also unable to reliably distinguish between participants who have and have not received prior information about the deceptive elements of the study, another piece of evidence that suspicion probes are invalid.

Such probes often rely on qualitative responses to experimenter interview questions, which can introduce experimenter biases into the procedure, as the experimenter's subjective interpretation of a qualitative response will impact their ultimate determination of whether a participant was suspicious or not.

Absent a standardized protocol, experimenters can also ask questions in different ways that lead participants towards a given response, which is another form of potential bias. Further, participants may express suspicion due to other motivations. They may be trying to avoid seeming like rubes who were fooled by the experimenters or are looking to cause problems for the people who just put them through a boring and/or distressing experiment.

Without valid and standardized probing procedures, the categorization of participants into 'suspicious' and 'non-suspicious' categories may not reflect a meaningful distinction that maps onto the intended construct. Even if they are valid, probes will always entail some degree of measurement error, which can have negative consequences for trying to accurately and reliably assess suspicion. However, even if such standardization and validation of suspicion probes takes place, it still leaves uncertainty regarding what to with 'suspicious' participants.

Should 'Suspicious Participants' Be Excluded From Data Analysis?
'Suspicious' participants are often excluded from data analyses, either by casewise deletion or statistical covariance. This practice is motivated by the assumption that suspicious participants provide inaccurate data about the study's hypotheses. But to what extent is this assumption based in fact?

Let's take a look with a specific case-study: studies that simulate real people with computerized avatars. Many studies tell participants that they are interacting with real people on a given task, when in fact they are interacting with pre-programmed computer avatars. Participants are often excluded if they indicate that they were suspicious during the study. However, studies suggest that people interact with avatars that they know are computer programs in similar ways to avatars they believe are actual people. So this is one case where the fears of 'contaminating' the analyses with the inclusion of suspicious participants wouldn't be warranted.

Even if participants are aware of your deception, there isn't any evidence (that I could find) that establishes a systematic way in which *including* such 'suspicious' participants would bias your results. Indeed, demand characteristics and the suspicion thereof (however it is defined) will likely have heterogeneous impacts on your different participants. Some participants may respond by trying to 'help' the experimenters and confirm the predictions they have guessed, some may actively seek to undermine the study, and still others may respond in random or erratic ways having lost faith that this study is a worthwhile endeavor. As such, this source of noise/error in your data should be roughly similar in structure to that introduced by participants exhibiting varying levels of understanding the experimenter's instructions, or actively engaging with and attending to the research task at hand. Investigators rarely screen for and remove participants for these variables, so why be so selective regarding suspicion?

Conversely, *excluding* your suspicious participants is likely to bias your resulting findings in meaningful ways. Recently, researchers have shown that excluding participants who fail attention checks eliminates non-random groups of participants from datasets, which undermines the validity and generalizability of the results. The same goes for suspicion probes, which likely eliminate individuals who tend to be more generally-suspicious, analytic, and intelligent. Thus, excluding 'suspicious' participants trades one problem (including participants who guessed your deception) with another one (eliminating a non-random portion of your sample).

Conclusion
For these reasons, I see merit in retaining your 'suspicious' participants*. Keeping them in our analyses makes our data more noisy but less biased, and removes an important source of measurement error and researcher degrees-of-freedom.

*Within reason. If every participant reveals your hypotheses back to you before you've had a chance to debrief them, your study needs to be re-worked to reduce such suspicion. Thus, suspicion probes may be helpful to pilot and refine our procedures to ensure a suspicion rate below a given threshold (e.g., 5% of participants report suspicion).

0 Comments

Questionable MAnipulation Practices (QMAPs)

12/3/2018

0 Comments

Measuring and manipulating psychological processes is the focus of methodological training in psychological research. Relevant to this, two of my favorite scholars to follow on Twitter (Jessica Kay Flake and Eiko Fried) have recently articulated something that has captured my interest in their 2018 APS talk: Questionable Measurement Practices (QMPs). QMPs reflect choices that investigators make when they seek to measure a psychological construct and can undermine the validity of the assessment of psychological constructs. According to their definition, a QMP is "a measurement decision that lacks justification and/or transparency (slide 21)." Flake and Fried then subdivide QMPs into decisions that are made during the Selection of the measure, the Use of the measure, and the Modification of the measure.

As an experimental psychologist, I see many ways in which this QMP approach can be applied to the practice of experimental manipulation - in which participants' psychological states are systematically influenced (or not) by experimenters, based on random assignment. If questionable practices are present in experimental psychology (and there's good reason to think that they are), then there must be Questionable MAnipulation Practices (QMAPs). Adapting Flake and Fried's definition, QMAPs are decisions relating to the conduct of experimental manipulations that lack an empirical justification or transparency. They may or may not undermine the validity or reliability of a given manipulation, but more often than not, such questionable practices are harmful to psychological science. The 'replicability crisis' in experimental psychology may be due, in part, to undisclosed and questionably-justified procedures regarding experimental manipulations (i.e., QMAPs) that yielded false positives and therefore non-replicable findings.

Below, I discuss QMAPs in the selection, creation, modification, and implementation phases of experimental manipulations in psychology.

Selection & Creation

Once you have identified a psychological state that you seek to manipulate, you have entered the 'selection' phase. Your goal here is to find or create a valid and reliable manipulation to employ to test your hypothesis. Let's say that I have crafted a hypothesis that entails the experimental manipulation of social rejection.

Right off the bat I'm at a crossroads - I can seek out an existing manipulation or create one myself. Without a clear theoretical justification* for one or the other (or without transparently documenting this decision process), I'm already into the tall grass of QMAP territory.

Let's say that I decide to use an existing manipulation because I think it's better to use one with demonstrated efficacy and validity than to go through the fraught process of making one myself. Again, I'm faced with multiple decisions. I can manipulate rejection by:
-leaving participants out of a ball-tossing game
-asking participants to recall an experience of rejection from their past
-telling participants that their personality indicates they'll be alone for the rest of their life
-telling participants that other people chose not to interact with them on a cooperative task
-priming them with images of disapproving faces
-being left out of an online chatroom conversation
-the list goes on and on.....

Which manipulation do I choose? Compounding this problem, a single manipulation is often implemented in many different forms in the published literature. So even after you have chosen a given manipulation, you may have to choose between several different versions of thereof. For instance, do I pick the Cyberball rejection manipulation with 2 other players or 3 other players? Do I use the one that includes face images of the partners or just the virtual avatars? So on and so on ......

As Flake and Fried point out, your true guide in this situation is theory. A solid and nuanced theoretical understanding of the construct you seek to manipulate will allow you to identify features that an experimental manipulation must have (and must not have) to be a valid approach to altering your intended construct. So long as your decision is based on sound theory and transparently disclosed, you can avoid engaging in QMAPs at this stage.

However, there may be no manipulation for your intended construct and you will have to create one yourself! The development of a new experimental manipulation entails innumerable decisions (e.g., online or in-person? simulate the experience with confederates or a computer program? deception or no deception?). The best practices for developing new manipulations is detailed elsewhere and beyond the scope of this piece. My hope is simply to draw awareness to the fact that each of the decisions we make when we develop a new manipulation have the possibility to be QMAPs.

Note:
*I fully acknowledge that finding a 'theoretical justification' for a manipulation-relevant decision is not hard to do, especially in the absence of strong theory. Many heuristic-based and intuitive choices can be put forth post hoc as theoretically-justified. This problematic practice has a simple solution --- preregistration.

Modification

Once you've chosen a manipulation, you must then consider whether to implement it exactly as it was detailed in the literature in which you found it, or to modify it in some way. Any deviation from the published description can become a QMAP if not justified and openly-disclosed. Below, I list some hypothetical examples of how modifications to a published experimental manipulation can be made and how they may be justified in ways that stem from practical or other considerations, instead of theory.

Although modifications such as cutting trials may seem like a sound idea for logistical reasons, they aren't justified by a solid theoretical rationale. Without this, we don't know if reducing the number of trials affects the underlying ability of the task to manipulate the construct of interest.

Implementation

Now that you have either obtained or created a suitable manipulation and modified it to your satisfaction, you have to look to its implementation. This is one of the least talked-about and most difficult phases of an experiment - yet if done improperly, it can entirely undermine the validity of your study. Below, I list some hypothetical examples of choices investigators often make in this implementation phase and ways in which they could be justified without theory.

Again, we see that there are many ways in which flexibility is built into the implementation phase. It may seem sensible to drop your suspicious participants, but perhaps your manipulation isn't actually invalidated among suspicious participants (or you didn't measure suspicion well) and you're now just removing a group of participants who happen to be lower in, let's say, agreeableness - thus biasing your study's sample and undermining the validity of your findings. Without a good theoretical rationale or empirical evidence, we just can't say whether it's a good approach or not.

Conclusion

Experimentally-manipulating psychological constructs is a crucial means to establish causal effects. The validity of such manipulations are crucial to the credibility of experimental psychology. QMAPs have the potential to undercut this validity. At the end of the day, the best weapons we have against QMAPs are:

-solid and ongoing training in theory, experimental methodology, and construct validation
-open science
-detailed and transparent Methods sections
-preregistration

We should commit ourselves to training the next generation of experimentalists in these areas and practicing them ourselves.

0 Comments

A Renewed Focus on the Validation of Experimental Manipulations in Psychology

11/28/2018

0 Comments

Recently, I've been thinking about Questionable MAnipulation Practices - the choices that experimental psychologists often make about our experimental manipulations that are often unjustified or not portrayed in a transparent fashion. Beyond this, there is another potentially problematic aspect of experimental psychology that continues to ensorcell me: our validation of experimental manipulations of psychological constructs.

The Potential Problems

Experimental psychologists receive remarkable training in how to develop and implement valid experimental manipulations of psychological constructs. Methods classes teach us how to design and execute manipulations that achieve internal validity by focusing on aspects such as avoiding:
-biases in the selection and attrition of participants
-experimenter effects
-demand characteristics and participant suspicion
-confounds between conditions

However, such training sometimes skips over a critical interstitial phase that resides between development and implementation: validation.

We are often taught that if we develop a precise manipulation and implement it flawlessly, it will be valid. But how do we know that it is valid? Here's the answer: if you skip the validation phase, you don't know if (and can't assert that) your manipulation is valid. Without empirical evidence of the validity of your manipulation, there is no scientific basis on which you can assert its validity.

A potential reason that psychology is in a so-called 'replication crisis' may be because we didn't spend sufficient resources, time, and energy on this validation phase of our manipulations.

Yet many experimental psychologists might disagree with me --- arguing that experimental manipulations are indeed validated. However, the current approach to the validation of experimental manipulations is undermined by several key issues:

1. An excessive or sole emphasis on face validity.

2. Use of unvalidated manipulation checks.

3. Assuming a manipulation is valid because it had the predicted effect on an outcome of interest.

***If you want more detail on these first three instances and my arguments for why they aren't good validation metrics, check out my previous blogpost on the topic.***

4. Using under-powered (and often unpublished) pilot studies. Many researchers do want to make sure their manipulations 'work' and to determine whether this is the case, they'll run pilot studies. These pilot studies are run-throughs of the manipulation alongside a manipulation check. They're often conducted with small samples in order to reserve greater resources for the real study of interest. The resulting pilot data often remain in laboratory file drawers, away from the scrutiny of peer review. Basing future experimental research off of underpowered pilot samples undermines the validity of experimental psychological findings - and keeping pilot data away from peer review could do the same.

5. Citing papers that did not validate the manipulation. Manipulations are often asserted to be valid by citing a previously-published study that used the given manipulation. However, many of these cited papers did not themselves conduct a thorough validation process and therefore this citation approach leverages the assumed credibility of the published literature to imbue manipulations with unfounded validity. This practice is not unique to manipulation validation. For instance, state-level measures of affect often cite previous papers that failed to present validation evidence.

6. Altering manipulations without first validating the modified version. If an investigator wants to create a shorter version of a questionnaire, they must first go through a lengthy validation process of this new version of the scale. However, researchers can often modify an experimental manipulation (e.g., by swapping out stimuli, altering the instructions) without then having to re-validate this new version of the manipulation.

7. Applying a manipulation to a new context or population without first validating this new application. If a researcher wants to use a questionnaire in a new country, they must painstakingly translate the scale and then validate it in this new cultural environment. If they want to apply a measure developed for young adults to a population of older adults, they must first demonstrate similar psychometric properties with this older population. This same standard is not upheld with experimental manipulations, which are often administered to different populations and in widely-varying contexts (e.g., an online version of an in-lab manipulation) without first ensuring that the underlying validity of the manipulation is upheld.

8. Interacting 2+ manipulations (e.g., in a 2 x 2 factorial design) without first examining whether one invalidates the other(s). Context matters, that's the whole reason we do experimental manipulations in the first place. So it stands to reason that an experimental manipulation may potentially alter the meaning, efficacy, and therefore the validity of a subsequent manipulation. For instance, a recently-rejected participant may not pay as much attention to the instructions of a subsequently-complex manipulation, which may create experimental artifacts.

9. Testing the effect of a manipulation without knowing its typical duration. We often speak of manipulations as having 'an effect'. This is a fallacy as the effect of a manipulation varies as a function of time since its implementation. No manipulation's effect is infinite or unchanging. The strength of any manipulation will rise and fall, but without knowing the timecourse of the effect, it's questionable to say at any given timepoint in your study what effect you expect the manipulation to have.

10. Using 'boosters' without validating them first. Investigators sometimes seek to re-animate the effects of a manipulation by:
-re-administering an abbreviated version of the manipulation
-reminding participants of the manipulation
If boosters are used without having been validated beforehand, they might have unintended effects. A repeated administration or a reminder of the same manipulation may have a qualitatively different effect on participants than the initial manipulation. In such cases, the 'boosters' aren't boosting the original manipulation, but changing the meaning of it entirely.

11. When deception is employed, failing to estimate suspicion attrition rates. Experimental manipulations in psychology often involve deception of participants to avoid drawing participants' focus towards the study's hypotheses or to simulate an experience that isn't real. Papers are often all over the map on how they assess the number of participants who were suspicious of their deception (i.e., the suspicion attrition rate) or whether they assessed it all. Without knowing how many participants disbelieved your deception, it remains uncertain whether your procedures induced the desired psychological state in your participants.

So, how does one fix this situation and effectively validate an experimental manipulation?

The Potential Solutions

To solve these problems we should discontinue the practices outlined above. But what then to put in their place? Fortuitously for experimental psychologists, we need not wander uncharted territories to learn how to validate our manipulations. We can just model our revolution on the validation procedures used for questionnaires.

1. Validate experimental manipulations prior to implementation (not alongside).
Researchers who seek to develop new self-report questionnaires or clinical assessments must often conduct (and publish) several validation studies before the scientific community adopts their measure. These efforts are summarized in 'validation papers' and are published in reputable journals dedicate to these projects (e.g., Assessment).

I argue that investigators should include and publish the results of this validation phase for new experimental manipulations. Instead of immediately applying a new manipulation to a focal hypothesis (e.g., social rejection increases political conservatism), several well-powered studies would first need to be run and scrutinized in order to estimate whether your manipulation exhibits empirical properties of construct validity (e.g., that your manipulation actually induces feelings of social rejection). These studies would then need to be subjected to the scrutiny of the peer review process.

Just as new questionnaires are published alongside an appendix that details the new measure, such manipulation validation studies would include a detailed step-by-step protocol of exactly how to implement the experimental procedure, including (but not limited to):
-how to randomize participants into each condition
-specifications about what testing environments should be used
-details about how testing rooms should be arranged
-scripts detailing exactly what experimenters should say to participants and when
-how experimenters/participants are blinded to condition
-training protocols for new research assistants
-quality checks to ensure that the study is being run appropriately

These protocols would be subject to the peer review process and could be posted publicly alongside any and all documentation, stimuli, and software code, in order to ensure that other labs can replicate the exact procedures. Other labs who seek to use the manipulation in their own work *must* then agree to adhere to the exact procedures outlined in the protocol, without deviation, in order to claim their use of the manipulation was valid. Badges could even be given to papers that demonstrated strict adherence to these standardized manipulation protocols.

This approach may entail a cultural shift towards a slower science. We are often eager to get right to the questions we want to answer, but this approach would require us to interject a laborious process between us and our desired hypothesis test. Taking the time to first stress-test our manipulations may be a frustrating-yet-necessary step towards increasing the credibility of experimental psychology.

This validate-then-implement approach would also potentially allow experimentalists to avoid the problems associated with employing manipulation checks in studies (e.g., drawing participants' attention to the true purpose of the manipulation). Once a manipulation has been validated, you could administer it without the manipulation checks - just as using a validated questionnaire doesn't then require that you also acquire an array of construct validation checks for that questionnaire.

Yet how do you demonstrate that your manipulation is valid in the first place? I turn to that next.

2. Map the nomological ripple of your manipulation.
The constructs we seek to experimentally-manipulate exist in a nomological network with other constructs. To say that our manipulation is valid, we must observe that it manipulates our construct of interest. However, by virtue of its nomological ties to other constructs, our manipulations will almost always simultaneously manipulate closely-related constructs to the one we seek to manipulate. However, if we have constructed a valid manipulation then the manipulation's effect should be strongest for the intended manipulation and progressively weaker on constructs that occupy related theoretical space. I refer to this diffusion of the manipulation effect through the nomological network as a nomological ripple and propose that it can be implemented as an empirical means of establishing the validity of your manipulation.

Manipulation validation studies could begin by articulating and depicting an abbreviated form of the nomological network around the construct they seek to manipulate (sample network depicted below). In this map, the core construct is depicted in the middle (Rejection), and nomologically-relate constructs could be depicted in concentric rings, with farther rings representing smaller expected ties to the core construct.

Constructs can be placed in relation to the core along a continuous radial gradient that reflects the strength of their correlation with the core construct (this can be obtained from existing literature). Stronger correlations (negative or positive) are placed closer to the core and weak to null correlations are set in the periphery. Each construct must be quantified by a validated measure thereof.

This nomological ripple can then be estimated by measuring each of the constructs in the network after the manipulation (in counterbalanced order) and plotting the corresponding effect size estimates (e.g., Cohen's d), as below [asterisks denote hypothetical statistical significance at p < .05].

A valid manipulation should elicit the largest effect upon the core construct, with diminishing effect sizes at farther distances from the core. If your manipulation had a stronger effect on a construct outside the core (e.g., Anger), then your manipulation could be better characterized as a manipulation of that construct (e.g., an Anger Manipulation) than that of the intended target.

Ideally, the confidence intervals (or whatever estimate of uncertainty you deem wise) surrounding the manipulation's effect on the core construct should not overlap with any of the other effect intervals - as this will allow you to infer that your manipulation has a (somewhat) meaningfully larger effect on the core than on the distributed network.

Assuming you demonstrate an acceptable nomological ripple, the next question to answer is: how long does my manipulation effect last?

3. Plot the timecourse of your manipulation.
Test-retest validity doesn't really apply to manipulations in the same way that it applies to trait questionnaires, but time is still an important factor in determining the validity of a given experimental manipulation. The duration of your manipulation effect, and its effect-size at each timepoint, is crucial data to obtain in order to ensure that dependent measures are within the 'active' aspect of the manipulation effect's timecourse (i.e., when the manipulation is exerting its intended effect). Further, experimenters can identify temporal 'sweetspots' where the effect is strongest or most stable. We assume that the effect is strongest immediately and maintains stability for some fleeting period of time, but this may not always (or even often) be the case. Indeed, some psychological reactions take time to build (e.g., the growing horror that you got your socks wet in the morning and will now have wet socks for the entire day). Establishing the timecourse of a manipulation's effect allows us the precision to say how long the effect lasts, how stable it is over time, and how strong it is at given timepoints in our study.

To do so, investigators should provide baseline assessments of the target construct, administer the manipulation, and then repeat such measurements as rapidly as is logistically possible (assuming that the validity of the construct's measure is not undermined by fast and iterative assessment). The manipulation's effect size on each measure should then be depicted across these measurements and the temporal quirks should be described. Does the manipulation elicit a 'quick burn' (i.e., an immediate effect that decays quickly), as depicted below?

Or does your manipulation exhibit a 'slow burn' (i.e., a temporally-stable effect), as depicted below?

These depictions of your effect are crucial evidence for the temporal boundaries of your manipulation's construct validity, and for other reasons as well.

Depending on the construct you seek to manipulate, the timecourse of your manipulation may be critical information for research ethics boards. If there are concerns about participants leaving the lab in a potentially harmful or dangerous state, a carefully-estimated timecourse may allow you to predict the duration that participants will need to return to baseline.

This also means that if you seek to employ boosters to maintain the effects of an experimental manipulation over time, you should validate any boosters you employ after your manipulation and establish their individual timecourses.

4. When deception is involved, estimate the suspicion attrition rate (SAR).
The construction of valid deception experiments emphasizes the need to minimize demand characteristics and subsequently, participant suspicion of your deception. However, such suspicion is not always empirically estimated and is certainly not estimated in a systematic and validated manner across laboratories.

To do so, investigators could administer a standardized suspicion probe (please someone make this, it would be one of the most cited instruments in psychology!) and report the rate of participants who express varying levels of disbelief in their deception (as depicted below).

This data would allow researchers to optimize their manipulation to minimize suspicion. Further, creators of a new manipulation could suggest exclusion criteria based on the results of the suspicion probe (e.g., "Participants who report total disbelief or some doubts should be excluded from all analyses"). Such an exclusionary-cutoff could be based on data showing that the manipulation no longer exhibits construct validity at specific suspicion cutoffs.

Further, investigators could pair these suspicion assessments with measures of individual difference measures in order to examine whether those who are suspicious of their deception tend to exhibit specific attributes. Such information could help researchers understand if their manipulation incidentally induces suspicion among certain types of people (resulting in their potential exclusion from the sample), which could serve to undermine the validity of the manipulation.

Conclusion

Current manipulation validation practices aren't held to the same standard we apply to psychological measures, and this may have big costs for experimental psychology. I recommend some potentially controversial changes to the manipulation validation process that might improve the evidentiary value, replicability, and credibility of experimental psychology. Switching to a validation-first, implementation-later paradigm has already clearly worked for those developing psychological questionnaires, tests, and assessments. The results of such validation procedures can be published and presented on their own and such papers are often highly-cited. Further, using a truly valid manipulation will increase the replicability of your own work and lead to more credible bases for our understanding.

Post-Script

Over the course of thinking about this, I've noticed that what constitutes an experimental manipulation is far from clear. Check out the results of this twitter poll I ran:

So perhaps the suggestions I provide above don't apply equally across all potential manipulations. If you're just presenting an array of measures, even if this meets strict experimental manipulation criteria, you probably don't need to validate this manipulation before you use it. But, if your manipulation is intended to induce a psychological state in your participants, then I think it must be validated before use in testing other hypotheses.

0 Comments

Validating a Standardized Approach to the Taylor Aggression Paradigm

4/22/2018

0 Comments

My lab recently published a paper in Social Psychological and Personality Science, which I encourage you to read for all the details, but I'd also like to talk a bit about it here as well. The page limits of the article did not allow me to go into all the depth with which I wanted to. In this paper, we attempted to validate a standardized version of the Taylor Aggression Paradigm.

What the Hell is the Taylor Aggression Paradigm?

Psychology has long agreed that aggression is an important phenomenon to study, but it's proven very tricky to do so. The majority of the literature has consisted of self- and peer-reports of aggression in survey research. However, to determine the causal forces behind human aggression, we needed to bring these assessments into the laboratory and detach them from the biases inherent in such reports.

In the 1950's and 60's, the scholars Epstein and Taylor developed a paradigm in which participants arrived to the laboratory, were set against an opponent in a competitive task, and were allowed to inflict harm on that opponent as part of the task (e.g., shock them with varying amounts of electricity). This general model became the Taylor Aggression Paradigm (hereafter TAP), which has enjoyed wide popularity and substantial modification.

The TAP has proven to be a target of controversy for lots of reasons, many of which center on the debate regarding violent media and aggression, which I give a wide fucking berth (not my monkey, not my circus). Others have also provided evidence for the TAP's internal and external validity.

The TAP is often the butt of jokes, where others mock the use of noise blasts (or pins stuck in a voodoo doll, or hot sauce dumped on saltines, or dunking a hand under ice-water) as ludicrous operationalizations of harm-doing. Given the ethical concerns of the lab, these forms of aggression are about as much as we're allowed to do. I'd love to find a harmless, ethical way to host a scientifically-valid Psychology Fight Club (see below), but it's just not feasible. If anyone has a better way of measuring aggression in the lab, I am all ears. Because aggression is so costly, it's incredibly important that we get this right, that we measure aggression in an accurate, reliable, and replicable way.

Combating Flexibility Issues with Preregistration

Like almost all other psychometric paradigms, the TAP can be implemented, scored, and analyzed in a flexible manner. This can be a blessing, allowing researchers to tailor the task to a given experimental setting or hypothesis, but can also be a curse, when researchers can misuse this flexibility to achieve illusory support for hypotheses by capitalizing on chance as they test a given prediction across numerous scoring and analytic regimes. This misuse of the TAP appears to be rampant. I'm not saying that this is the case for any given body of work or scholar, I'll leave that determination to you, my esteemed Reader.

The TAP has emerged as the primary target of this debate about the role of flexibility in undermining sound science, but before you throw your personal pitchfork at the task, ask yourself whether a psychometric instrument you use could pass the same bar. Any brief survey of the literature will show you that many of the most popular tasks (e.g., Stroop, Go/No-Go, Mind In The Eyes) are implemented, scored, and analyzed in highly flexible ways. This flexibility often applies to questionnaires too (e.g., subscales vs. total scores, long-form vs. short-form, retain vs. drop items that bring alpha below .70).

So what is one to do? Simple, preregister your hypotheses, implementation plan, scoring plan, and analytic strategy, and stick to it as best you can, and that's what we did across 2 studies. I encourage you to check out our preregistration plans for Study 1 and Study 2, and our data/code/materials.

Findings

We used a computerized version of the TAP that measured aggression as the volume and duration with which participants decided to administer blasts of a very uncomfortable noise (think of a cat getting sucked into a jet turbine) across 25 trials. Across the task, their opponent (a computer program) initially and then repeatedly provoked them by selecting loud and long noise blasts to administer.

Across both studies, we averaged scores across all trials of the TAP, driven by the logic that a greater number of measurements will yield a more accurate and reliable estimate of aggression.

We found that louder noise blasts on the TAP corresponded to greater aggression on two other canonical aggression measures: the Voodoo Doll Aggression Task and the Hot Sauce Aggression Task (see figure below), as well as a self-report measure of trait physical aggression. Thus, the task exhibits convergent validity with other aggression measures, which is great! However, showing that one 'contrived' laboratory aggression measure (as described by several of my reviewers) corresponds to two other ones, isn't enough evidence to claim that the task is a solid aggression measure.

If you want to evoke aggression in someone else, what would you do? Many of you would probably say 'insult them' and that is exactly what we did to our participants. We told half of them that an essay they wrote was garbage, a great way to evoke aggression from students, and ethical enough to use in the lab. As we predicted, doing so increased aggressive behavior on the TAP, providing evidence for the task's construct validity. <<Props to the R package vioplot for the figure below>>

One of the most debated topics around the TAP is its external/predictive validity. Can the TAP predict who is aggressive in the real-world? To try and determine whether this was the case, we asked people how many physical fights they had been in across varying time spans. We got some seriously mixed results, with TAP scores being associated with greater physical fight frequency over the past year and 'ever', but not over the past 5 years. I'm not really sure what to make of these mixed results, so I'm just going to leave the external validity of the task as 'currently unknown' in my book. I don't think that my measure was an ideal assessment of real-world violence (e.g., being in a fight doesn't mean you started it), so I want to more rigorously approach this issue in the future.

To assess the task's discriminant validity, I wanted to identify variables that were similar to aggression but conceptually-distinct to ensure that the TAP was capturing aggression and not something else. First, I chose *verbal* aggression as the TAP is a physical aggression measure and therefore shouldn't also capture other forms of aggression. Second, I chose self-harm, as the TAP should capture the tendency to harm others and not the self. Across, both studies there were weak and marginal associations between scores on the TAP and these two variables. This could either mean that the TAP does not exhibit good discriminant validity, or even more likely, that I picked 2 imperfect variables for this purpose. Both of them correlate with physical aggression to a reliable degree. In the future, the discriminant validity of the TAP needs to be investigated with variables that are both conceptually-distinct from and uncorrelated with physical aggression.

Tasks with multiple assessments also need to be internally-consistent, or else the lack of reliability undermines any inferences gleaned from the task. Principal components analysis showed that the 50 TAP measurements largely loaded onto a single component (see scree plot below), which suggests that the aggregate scoring approach we took was appropriate (though see below for reasons why that may not be that case).

TAP scores were no different between self-identified males and females, which were admittedly not sampled equally. This was very surprising, given the well-established higher rates of physical violence among males. However, the effect of gender on aggression is not as simple as we once thought, and these findings may reflect this complexity (or not).

Exploratory Analyses

These datasets allowed us to examine some other, un-preregistered questions, many of which were suggested by our incredible editor and reviewers.

We used structural equation modeling to examine whether the aggregate scoring approach showed good fit to the data. Sadly, modeling a single latent factor that each of the TAP's 50 measurements loaded onto showed pretty crappy model fit. We tried to improve things by separately modeling the first 2 measurements as a separate factor, as several aggression researchers have told me that the first trial of the TAP is a 'clean' measure of 'unprovoked' aggression as it precedes the provocation that is often built into the first exchange between participants and their opponents on the task. Doing so didn't help. Neither did modeling the second trials as its own factor (as some say that Trial 2 is a 'clean' measure of 'provoked' aggression, as it immediately follows the opponents' pre-programmed provocation). Divvying up the measurements into a volume and a duration latent factor showed the best, though still poor, model fit. Perhaps this suggests that we should separately model volume and duration settings from the task, but the fact that they correlate at r = .93 tells me that the results between the two factors won't meaningfully differ.

We also used SEM to examine the taxonomy of aggression measures. First, we examined whether our self-reported and behaviorally-assessed measures of aggression loaded onto a single aggression component (left figure panel below). This approach didn't fit the data well, so we tried it again modeling self-reported and behavioral aggression as separate, though correlated, factors. This model fit the data much better (right figure panel below).

Because the self-reported aggression measure was also a trait aggression measure, it's impossible to know whether these findings mean that self-reports and behavioral assessments of aggression are meaningfully different, or if it simply comes down to state vs. trait. However, the model does show that TAP scores had the highest loading onto the behavioral aggression factor, followed closely by pin counts from the Voodoo Doll Aggression Task, and far behind was the Hot Sauce Aggression Task. Though exploratory, these findings might suggest that the TAP is a superior aggression measure to the other two.

We also examined the presence of curvilinear effects of various variables on TAP scores (see below), finding two. As TAP scores increased the association with voodoo doll pin counts became negative and the association with hot sauce allocations became more positive. These conflicting data patterns are hard to interpret, but might simply reflect that the TAP scores become less accurate at their extreme. Indeed, participants who only set the volume and duration at the maximal value on every trial may not be taking the task seriously or are trying to 'troll' the researchers. We should perhaps be cautious of findings obtained only using the 'extreme aggression' scores from the TAP. However, curvilinear effects are notoriously difficult to detect in non-large samples, so perhaps these finnicky results are due to that simple statistical issue.

Conclusions and Future Directions

First things first, this was a flawed and preliminary first-go. The manuscript details ways in which things went wrong during data collection and ways in which we deviated from the preregistration plan. Ideally, I'd do the whole thing again without any deviations from the preregistration or errors. These aside, I think our findings offer cautious optimism for the use of the TAP (assuming that its implementation, scoring, and analysis are preregistered). The TAP seems to agree with other measures of the same thing and show reliable reactivity to provocation. The evidence for the TAP's external and discriminant validity are mixed, but this is likely due to poor psychometric choices on behalf of yours truly that I've detailed above. Overall, the evidence support the use of an aggregate scoring approach to the TAP. It is completely uncertain how other scoring approaches might fare, though you should check out this preprint that may offer some insight.

Like almost any psychological measure, the TAP and its inherent flexibility, can be misused. That doesn't mean the task itself is flawed. You don't blame the car when an absent-minded driver flips over a highway median, and in that way you shouldn't blame the TAP for operator error. In this project, we tried to take such error off the table by preregistering our practices (with mixed success). In doing so, I think we have shown some imperfect, preliminary evidence that this approach to the TAP itself is alright, which places the mantle of responsibility at the feet of the investigators, where it should be. Now it's our job to respect this tool and do things the right way.

0 Comments

Validating Experimental Manipulations

2/4/2018

0 Comments

I've been thinking and talking quite a bit about the often-ignored practice of empirically-validating the experimental manipulations that psychologists use, so I decided to blog about it here. Am I qualified to do so? Not to the extent that this topic deserves. In graduate school, I took an excellent Personality Psychology course with Dr. Suzanne Segerstrom, who made construct validation accessible and a priority to understand. I've started to do my own validation research, examining the potential validity of the Taylor Aggression Paradigm, and I regularly supplement my own ignorance by collaborating with clinical and personality psychologists who are more expert in this area. With that in mind, here we go unto the breach.

Experimental psychologists (like myself) often develop experimental manipulations intended to cause changes in individuals' thoughts, feelings, and behavior. However, these manipulations are often considered 'valid' based on a few criteria:
-face validity (does it appear to manipulate what it intends to manipulate?)
-manipulation checks
-whether it achieved its predicted outcome

Let me give you an example. My lab often uses an essay-feedback paradigm to provoke participants into acting aggressively. In this manipulation, participants get very nasty or very nice feedback on an essay they just wrote. We say that this is valid because:
-it looks like a valid way to increase aggression (e.g., participants fume and scoff when the get harsh feedback [see Jon Stewart below for a depiction of such responses])
-it increases scores on a manipulation check questionnaire, in which participants self-report how provoked they felt by the feedback
-it increases aggressive behavior in the lab

But is that really enough to say I've validated my experimental manipulation? I don't think so. Here's why:
1. We can't always trust our intuitions about face validity. Just because something appears to have certain properties doesn't mean it always does. Our personal biases make us see things inaccurately, and we may see that a manipulation is face valid because we really, really want it to be.
2. Manipulation checks are a good idea when it comes to validating experimental manipulations. However, the manipulation check measures are often not validated themselves, beyond appearing to measure the construct that they measure (see previous statement about issues with face validity).
3. Saying an experimental manipulation is valid because it had the intended effect on your outcome is the same as saying 'it works because it worked' and is tautological. There are many other reasons it could've had the desired effect that have nothing to do with your hypotheses (e.g., the deceptive elements of your manipulation may have been laughably transparent, putting participants in a humorous mood state).

So, what can we do about this? When I tear down, I also like to build up, so I make some suggestions below that came from conversations with and readings of personality and clinical psychologists who have dedicated many decades to psychometric validation. I was implicitly aware of these issues over my short career, but when it came to applying better validation techniques:

A few things we can do to promote the validation of experimental manipulations:

1. Use (and create!) validated manipulation checks and outcome measures.
A great deal of psychometric validation work has focused on personality trait questionnaires and clinical assessment tools. However, there is a dearth of validation work being done on questionnaires that measure state-level processes and on behavioral measures. We need to spend more time developing, systematizing, and validating these measures as a foundation on which to build validated experimental manipulations.

Caveat: The inimitable Dr. Sanjay Srivastava pointed out that this creates a problematic loop, in which we need to validate state measures, which would require validating them with validated experimental manipulations, which would require the use of validated state measures to validate the manipulations, so on and so on.

2. Identify the nomological network around your manipulation.
Lee Cronbach and Paul Meehl coined the nomological network to represent the constellation of variables (latent and observed) that orbit around a given construct, and the relationships they share. This approach is useful here because it asks you to go beyond the manipulation check and to consider the effect of a given manipulation on other variables to best triangulate its actual construct validity.

In practical terms, this means that we should examine the effect of a manipulation on just our manipulation check and outcome of interest, but also on theoretically-related variables that the manipulation should also influence. Additionally, we should also examine the effect of the manipulation on variables it is not supposed to affect.

Going back to our example with the essay feedback paradigm, I should test whether the manipulation increases feelings of provocation (the manipulation check) and aggression (the outcome of interest), but also increases states that are theoretically, positively-linked to aggression (e.g., anger, approach motivation, hostility), decrease those that are negatively-linked to aggression (e.g., empathic concern, inhibition), and have no effect on states that do not relate to aggression (e.g., mating preferences, working memory); this last part is particularly challenging to identify for me.

If you're looking to graphically map your given theory's constructs and the relations between them, Dr. Kurt Gray's Theory Mapping website, is an awesome tool for this.

3. Train graduate students in experimental psychology in validation techniques.
I don't have any data on this, but my sense of experimental psychology graduate training is that psychometric validation is not a core feature. It's a focus of clinical and personality psychology programs, but experimental psychologists may sometimes fail to emphasize training in these areas and give more focus to things like developing realistic and deceptive manipulations (which is, of course, also very important). There is a cost to this. Indeed, part of our replication crisis may be due to the use of unvalidated experimental manipulations. Our graduate Methods courses could combat this by including training in how to validate measures and manipulations.

4. Set aside time, prestige, and journal pages for studies that purely focus on validating a manipulation.
It can be frustrating to set aside time to validate experimental manipulations, instead of simply using them immediately in a study and relying on manipulation checks. However, if we can properly incentivize such work, then it will not require trading-off between more 'substantive' projects and rigorous validation work. As examples of how this is going, editors from the journals Assessment, Advances in Methods and Practices in Psychological Science, and the Journal of Research in Personality have all welcomed such submissions (see below).

A couple of final thoughts:

We often assume that our manipulations (essay feedback) exert their effects on our outcomes-of-interest (increased aggression) because they operate on a specific mechanism (increased provocation). However, we need to test these assumptions, which can be done by experimentally manipulating the proposed mediator (which, of course, presupposes the existence of a validated manipulation of the mediator) and would likely require massive samples.

A lingering question I have is 'do we need to re-validate manipulations when we use them in conjunction with other manipulations?' If I combine a rejection manipulation with another one that, say, induces feelings of empathy, do I need to re-validate the rejection manipulation to ensure that, even in the presence of experimentally-increased empathy, that it still does its job?

In closing, I realize that this is yet another blog post that suggests we fix something in experimental psychology that will be very difficult to fix and opens up its own host of methodological questions and issues. Further, I did this by pointing out a bunch of ideas that are not mine or new by any means. Further(er), I am guilty of publishing experimental manipulations without validating them, and it's likely I will publish more papers that fail to do so. Even so, I am hopeful that if I/we put a bit more time and energy into validating our experimental manipulations, that our inferences will be improved and therefore our understanding.

0 Comments

First Post, First Twitter Controversy

11/26/2017

2 Comments

I was avoiding blogging, but decided to start for two reasons:
-A friend told me that blogging is a good way to clarify your position in the public record
-A recent study found that blogs are remarkably rare in the psychological world

What motivated this specific post occurred this past weekend, when I vented a little bit on social media, which received an unexpected degree of response from parts of the online scholarly community.

It started on Saturday evening. I'll leave out the details, but let's just say that a group of young scholars were consistently shellacking the scientific articles they were reading without providing any constructive means to help rectify the problems they were pointing out (after multiple requests to do so). Growing frustrated with this, I tweeted the following:

I feel strongly about this issue as pointing out flaws in research can be done by anyone, even without proper scientific training. We should expect more out of trained scientists (and scientists in training), and ask that they go beyond pulling down the pillars of a given theory or study, and challenge them to propose how it should've been done differently (with some big caveats that I touch on farther down).

The tweet seemed harmless and I went to bed. The next day, as I came back home from grabbing the morning coffee, I noticed that my twitter account was receiving a shit-ton of notifications. This was largely due to the fact that Dr. Simine Vazire, an eminent psychologist, re-tweeted my statement with the following:

Dr. Vazire's tweet was respectful and collegial and over 100 people (I feel like that's a lot) showed support for her disagreement.

I have immense respect for Dr. Vazire and was surprised that anyone noticed my tweet at all, let alone such an eminent scholar. It took about a second of reflection to realize that my tweet was likely interpreted as an attack on the many recent instances in which researchers had failed to replicate other scientists' effects or found evidence that certain theories were not well supported by the evidence. I didn't have this in mind when I sent my original tweet, but I can see how it was interpreted that way.

When I scrolled through the list of other scholars who virtually voiced their support for Dr. Vazire's disagreement, I found myself able to 'check off' certain prominent researchers who I knew played central roles in psychology's methodological reformation movement (an imperfect name, but there it is). I found myself missing just one person, who then quickly appeared via retweet:

I got the sense that folks were 'rallying around the flag'. Which was confusing as I never meant to attack anything.

The remarks were largely respectful and focused on the ideas, but quickly it became clear that I had been labeled by some of the crowd as an opponent to replication, scientific criticism, and methodological reform. Tweets like this appeared:

If you're looking to paint me as an opponent to methodological reform and the value of replication and critique, I'm not your enemy. I'm committed to replicable, open, and rigorous science, and I can back that up. Here are some examples:
-I'm currently participating in a multi-lab registered replication project that will seek to publish results regardless of whether the replication attempt is "successful"
-I teach the value of replication and methodological reform to my graduate and undergraduate psychology courses
-I recently published my first paper with open data and code
-Studies from my (1.5-year old) lab are required to be preregistered (example 1, example 2)
-Studies from my lab are required, whether the hypotheses are supported or not, to have their data, code, and materials publicly shared (example)

Regarding my tweet, I'd like to walk it back and say that "tearing down" others' work isn't "intellectually unimpressive", but I do still believe that it's less impressive than also describing how the targeted work could have been improved. However, I do also agree with the tweet below, that sometimes, there isn't any building up to do:

And just to be clear: In cases such as fraud, data fabrication, and statistical errors, the response should be to immediately tear-down without any concern for building up.

I'll close with a reflection. I know many people in academia and beyond who avoid social media like the plague because of experiences just like this. They fear that an errant tweet or Facebook comment will land them in the critical sights of their peers, much like I did. It's very scary to know that others, who have never met you, are judging you and your ideas, based off a small amount of data (N = 280 characters). Several individuals used a single tweet to determine I was a particular type of person and they treated me accordingly. Scholarly debates often involve personal inferences based on small samples (e.g., 1 tweet), which leads to a lot of noise being interpreted as signal.

I worry about the voices we are missing out on because of these fears.

As for the rest of it . . . ?

2 Comments

Contingent Validity: Experimental Manipulations Deserve A More Prominent Role in Validating Psychological Trait Scales

A Pharmaceutical Approach to Manipulation Validation

Validating Experimental Manipulations w/ Passive Control Conditions

An Argument for Retaining 'Suspicious' Participants in Deception Studies

Questionable MAnipulation Practices (QMAPs)

Selection & Creation

Modification

Implementation

Conclusion

A Renewed Focus on the Validation of Experimental Manipulations in Psychology

The Potential Problems

The Potential Solutions

Conclusion

Post-Script

Validating a Standardized Approach to the Taylor Aggression Paradigm

What the Hell is the Taylor Aggression Paradigm?

Combating Flexibility Issues with Preregistration

Findings

Exploratory Analyses

Conclusions and Future Directions

Validating Experimental Manipulations

First Post, First Twitter Controversy

David Chester's Blog

Archives