āCounterfactualā is not a synonym for ācontrol groupā. In fact, the treatment groupās actual outcomes are used when estimating the control groupās counterfactual outcomes, which is necessary to estimate the average treatment effect on control (ATC) or average treatment effect (ATE) estimands.

An individual’s treatment effect is defined as a within-person difference between the potential outcome following treatment and the potential outcome following control. This individual treatment effect is impossible to measure, since only one potential outcome is realised depending on which group the individual was in. However, various averages of the treatment effects can be estimated.

For ATC, we are interested in estimating averages of these treatment effects for control group participants. We know the control group’s actual outcomes. We also need to answer the counterfactual query:

If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATC using matching, we need to find a treatment group match for each individual in the control group. Those treatment group matches are used to estimate the control group’s counterfactual outcomes.

For ATE, we are interested in estimating averages of these treatment effects for all participants. This means we need a combination of answers to the following counterfactual queries:

(a) If individuals in the treatment group had been assigned control, what would their average outcome have been?

(b) If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATE using matching, each treatment individual needs a control group match and each control group individual needs a treatment group match. So, for ATE, both treatment and control groups could be considered counterfactuals, in the sense that they are both used to estimate the other group’s counterfactual outcomes. However, I think it is clearer if we draw a distinction between group (treatment or control) and what we are trying to estimate using data from a group (actual or counterfactual outcomes).

Process tracing isĀ an application of Bayes’ theorem to test hypotheses using qualitative evidence.Ā¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. ThisĀ means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.Ā²

Next, let’s run the mediation analysis. It is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been statistically significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Ā¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

Ā² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully all is clear given the prose.

For example, P(Hyp = Med | ATE = Alternative) = Ā 47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Stata has had kernel matching for years in psmatch2 and kmatch. I couldnāt find any obvious R packages. I had a play to try to get my head around what Stata’s packages are doing. Use at your own risk.

“Matching with replacement induces two types of correlations that must be accounted for when estimating the variance of estimated treatment effects. The first is a within-matched set correlation in outcomes. Matched subjects within the same matched set have similar values of the propensity score. Subjects who have the same value of the propensity score have measured baseline covariates that come from the same multivariate distribution. In the presence of confounding, baseline covariates are related to the outcome. Thus, matched subjects are more likely to have similar outcomes compared to two randomly selected subjects. The second source of correlation is induced by repeated use of control subjects. Failure to account for this correlation and acting as though the matched control subjects were independent observations will likely result in estimated standard errors that are artificially small and estimated confidence intervals that are artificially narrow. Added complexity is introduced by having subjects cross-classified with matched sets such that the same control subject can belong to more than one matched set.”

Suppose we were to run an uncontrolled pre-post evaluation of an intervention to alleviate psychological distress. We screen participants for distress and invite those with scores 1.5 SDs or more above the mean to take part. Then, following the intervention, we collect data on distress again to see if it has reduced. The measure we have chosen has a test-retest reliability of 0.8.

Here is a picture of simulated findings (scores have been scaled so that they have a mean of 0 and SD of 1). Red points denote data from people who have been included the study.

I have setup the simulation so that the intervention had no effect, in the sense that outcomes would have been identical in the absence of the intervention. However, looking at the right hand side, it appears that there has been a reduction in distress of 1.1 SDs – a huge effect. This is highly “statistically significant”, p < .001. What happened?!

Tweaking the simulation

Let’s try a different simulation. This time, without any screening, so everyone is included in the intervention regardless of their levels of distress (so all the data points are red):

Looking at the right hand side, the pre-post change is 0 and p is close to 1. There is no change.

Next, select participants whose scores are at the mean or above:

The pre-post change is now statistically significant again, with improvement of 0.27 SDs.

Select participants with more extreme scores, 1.5 SDs or above at baseline, and we see the magnitude of change has increased again:

What happens if we increase the test-retest reliability of the measure to 0.9?

Firstly, the scatterplot on the left is a little less fuzzy. The magnitude of change has reduced to 0.48 SDs.

Finally, let’s make the measure perfectly reliable so that the scatterplot on the left is a fuzz-free straight line:

Now there is no change.

What’s going on?

I have simulated the data so that the intervention had zero impact on outcomes, and yet for many of the analyses above it does appear to have alleviated distress.

The extent to which the effect illustrated above, called regression to the mean, occurs partly depends on how selective we are in inviting participants to join the study. At one extreme, if there is no selection, then the mean change is still zero. At the other extreme, when we are highly selective, then change is over 1 SD.

This is because by selecting people with particularly high scores at baseline, there’s an increased chance that we include people who had, for them, a statistically rare score. Perhaps they had a particularly bad day, which wasn’t indicative of their general levels of distress. Since we selected them when they happened to have a bad day, on measuring again after the intervention, there was a good chance they had a much less extreme score. But this reduction was entirely unrelated to the intervention. We know this because the simulation was setup so that the intervention had zero effect.

Making test-retest reliability perfect also eliminates regression to the mean. However, this is unlikely to be possible for most of the characteristics of people that are of interest for interventions.

You can play around with the app I developed to simulate the data over here.

Regression to the mean is just one reason why interventions can spuriously appear to have an effect. Carefully chosen control groups, where possible with random assignment to intervention or control, can take account of alternative explanations of change.

A reason to compute adjusted predictions (or estimated marginal means) is to help understanding the relationship between predictors and outcome of a regression model. In particular for more complex models, for example, complex interaction terms, it is often easier to understand the associations when looking at adjusted predictions instead of the raw table of regression coefficients.

The next step, which often follows this, is to see if there are statistically significant differences. These could be, for example, differences between groups, i.e.Ā between the levels of categorical predictors or whether trends differ significantly from each other.

TheĀ ggeffectsĀ package provides a function,Ā hypothesis_test(), which does exactly this: testing differences of adjusted predictions for statistical significance. This is usually calledĀ contrastsĀ orĀ (pairwise) comparisons. This vignette shows some examples how to use theĀ hypothesis_test() function and how to test whether differences in predictions are statistically significant.

Breznau, et al. (2022) asked a group of 161 researchers in 73 teams to analyse the same dataset and test the same hypothesis: greater immigration reduces public support for the welfare state. As we now expect in this genre of the literature, results varied. See the study’s figure below:

So roughly 60% of analyses found a non-statistically significant result. Of the 40% that were statistically significant, 60% found a negative association and 40% found a positive association.

Social scientists are well-versed in the replication crisis and, e.g., the importance of preregistering analyses and not relying too heavily on the findings from any one study.

Mathur et al. (2022) offer a glimmer of hope, though. The variation looks fairly wild when focussing on whether a hypothesis test was statistically significant or not. However, 90% of analyses found that a one-unit increase in immigration was associated with an increase or decrease in public support of less than 4% of a standard deviation – tiny effects!

I also find hope in all the meta-analyses transparently showing biases. It seems that quantitative social science is the most unreliable and difficult to replicate form of social science, except for all the others.

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person

Condition

Y(0)

Y(1)

TE

1

1

0

7

7

2

0

3

0

-3

3

1

2

9

7

4

1

1

8

7

5

0

4

1

-3

6

1

3

10

7

7

0

4

1

-3

8

0

8

5

-3

9

0

7

4

-3

10

1

3

10

7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects:Ā

Alas we aren’t really omniscient, so in reality see a table like this:

Person

Condition

Y(0)

Y(1)

TE

1

1

?

7

?

2

0

3

?

?

3

1

?

9

?

4

1

?

8

?

5

0

4

?

?

6

1

?

10

?

7

0

4

?

?

8

0

8

?

?

9

0

7

?

?

10

1

?

10

?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference inĀ group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person

Condition

Y(0)

Y(1)

TE

1

1

7

7

0

2

0

3

3

0

3

1

7

7

0

4

1

7

7

0

5

0

3

3

0

6

1

7

7

0

7

0

3

3

0

8

0

3

3

0

9

0

3

3

0

10

1

7

7

0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person

Condition

Y(0)

Y(1)

CE

1

1

?

7

?

2

0

3

?

?

3

1

?

7

?

4

1

?

7

?

5

0

3

?

?

6

1

?

7

?

7

0

3

?

?

8

0

3

?

?

9

0

3

?

?

10

1

?

7

?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

ATT: should an intervention currently being offered continue to be offered or should it be withheld?

ATC: should an intervention be extended to people who don’t currently receive it?

ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.

Summary

Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

“The problems of gender and racial bias in our information systems are complex, but some of their key causes are plain as day […]. When data teams are primarily composed of people from dominant groups, those perspectives come to exert outsized influence on the decisions being madeāto the exclusion of other identities and perspectives. This is not usually intentional; it comes from the ignorance of being on top. We describe this deficiency as a privilege hazard.”

– Catherine D’Ignazio and Lauren F. Klein (2020). Data feminism. MIT Press.