Suppose we were to run an uncontrolled pre-post evaluation of an intervention to alleviate psychological distress. We screen participants for distress and invite those with scores 1.5 SDs or more above the mean to take part. Then, following the intervention, we collect data on distress again to see if it has reduced. The measure we have chosen has a test-retest reliability of 0.8.
Here is a picture of simulated findings (scores have been scaled so that they have a mean of 0 and SD of 1). Red points denote data from people who have been included the study.
I have setup the simulation so that the intervention had no effect, in the sense that outcomes would have been identical in the absence of the intervention. However, looking at the right hand side, it appears that there has been a reduction in distress of 1.1 SDs – a huge effect. This is highly “statistically significant”, p < .001. What happened?!
Tweaking the simulation
Let’s try a different simulation. This time, without any screening, so everyone is included in the intervention regardless of their levels of distress (so all the data points are red):
Looking at the right hand side, the pre-post change is 0 and p is close to 1. There is no change.
Next, select participants whose scores are at the mean or above:
The pre-post change is now statistically significant again, with improvement of 0.27 SDs.
Select participants with more extreme scores, 1.5 SDs or above at baseline, and we see the magnitude of change has increased again:
What happens if we increase the test-retest reliability of the measure to 0.9?
Firstly, the scatterplot on the left is a little less fuzzy. The magnitude of change has reduced to 0.48 SDs.
Finally, let’s make the measure perfectly reliable so that the scatterplot on the left is a fuzz-free straight line:
Now there is no change.
What’s going on?
I have simulated the data so that the intervention had zero impact on outcomes, and yet for many of the analyses above it does appear to have alleviated distress.
The extent to which the effect illustrated above, called regression to the mean, occurs partly depends on how selective we are in inviting participants to join the study. At one extreme, if there is no selection, then the mean change is still zero. At the other extreme, when we are highly selective, then change is over 1 SD.
This is because by selecting people with particularly high scores at baseline, there’s an increased chance that we include people who had, for them, a statistically rare score. Perhaps they had a particularly bad day, which wasn’t indicative of their general levels of distress. Since we selected them when they happened to have a bad day, on measuring again after the intervention, there was a good chance they had a much less extreme score. But this reduction was entirely unrelated to the intervention. We know this because the simulation was setup so that the intervention had zero effect.
Making test-retest reliability perfect also eliminates regression to the mean. However, this is unlikely to be possible for most of the characteristics of people that are of interest for interventions.
You can play around with the app I developed to simulate the data over here.
Regression to the mean is just one reason why interventions can spuriously appear to have an effect. Carefully chosen control groups, where possible with random assignment to intervention or control, can take account of alternative explanations of change.