Data alone can’t determine causal structure

Suppose we find that the probability of a successful programme outcome (out) depends on treatment (treat) and mediator (med) as per the Bayes network depicted in part 1 of the figure below. Suppose also that there are no other unmeasured variables. This model defines \(P(\mathit{out} | \mathit{treat}, \mathit{med})\), \(P(\mathit{med} | \mathit{treat})\), and \(P(\mathit{treat})\). The arrows denote these probabilistic relationships.

Interpreting the arrows as causal relations, then all six models above are consistent with the conditional probabilities. Model 2 says that treatment and outcome are associated with each other because the mediator is a common cause. Model 3 says that the outcome causes treatment assignment. Model 4 says that the treatment causes mediator and outcome; however, outcome causes mediator. And so on. These six models are all members of the same Markov equivalence class (see Verma & Pearl, 1990).

We need something beyond the data and statistical assocations to distinguish between them: theory. Some of the theory might be trivial, e.g., that the outcome followed treatment and can’t have caused the treatment because we have ruled out time travel.


Verma, T., & Pearl, J. (1990). Equivalence and synthesis of causal models. Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, 255–270.

Time for counterfactuals

I have just discovered Scriven’s stimulating (if grim) challenge to a counterfactual understanding of causation (see the debate recorded in Cook et al., 2010, p. 108):

“The classic example of this is the guy who has jumped off the top of a skyscraper and as he passes the 44th floor somebody shoots him through the head with a .357 magnum. Well, it’s clear enough that the shooter killed him but it’s clearly not true that he would not have died if the shooter hadn’t shot him; so the counterfactual condition does not apply, so it can’t be an essential part of the meaning of cause.”

I love this example because it illustrates a common form of programme effect and summarises the human condition – all in a couple of sentences! Let’s reshape it into an analogous example that extends the timeline by a couple of decades:

“A 60 year old guy chooses not to get a Covid vaccine. A few months later, he gets Covid and dies. Average male life expectancy is about 80 years.”

(I guess jumping is analogous to being born!)

By the end of the first sentence, I reason that if he had got the vaccine, he probably wouldn’t have died. By the end of the second sentence, I am reminded of the finiteness of life. So, the vaccine didn’t prevent death – similarly to an absence of a gunshot in the skyscraper example. How can we think about this using counterfactuals?

In a programme evaluation, it is common to gather data at a series of fixed time points, for instance a few weeks, months, and, if you are lucky, years after baseline. We are often happy to see improvement even if it doesn’t endure. For instance, if I take a painkiller, I don’t expect its effects to persist forevermore. If a vaccine extends life by two decades, that’s rather helpful. Programme effects are defined at each time point.

To make sense of the original example, we need to add in time. There are three key timepoints:

  1. Jumping (T0).
  2. Mid-flight after the gunshot (T1).
  3. Hitting the ground (T2).

When considering counterfactuals, the world may be different at each of these times, e.g., at T0 the main character might have decided to take the lift.

Here are counterfactuals that make time explicit:

  • If the guy hadn’t jumped at T0, then he wouldn’t have hit the ground at T2.
  • If the guy hadn’t jumped at T0, then he wouldn’t have been shot with the magnum and killed at T1.
  • If the guy had jumped, but hadn’t been shot by the magnum, he would still have been alive at T1 but not at T2.

To assign truth values or probabilities to each of these requires a model of some description, e.g., a causal Bayesian network, which formalises your understanding of the intentions and actions of the characters in the text – something like the DAG below, with conditional probabilities filled in appropriately.

So for instance, the probability of being dead at T2 given jumping at T0 is high – if you haven’t added variables about parachutes. What happens mid-flight governs T1 outcomes. Alternatively, you could just use informal intutition. Exercise to the reader: give it a go.

Using the Halpern-Pearl definitions of causality on this model (Halpern, 2016), jumping caused death at both T1 and T2. The shooting caused death at T1 but not T2. (R code here – proper explanation to be completed, but you could try this companion blog post and citation therein.)

Back then to the vaccine example, the counterfactuals rewrite to something like:

  • If the guy hadn’t been born at T0, then he wouldn’t have died at T2.
  • If the guy hadn’t been born at T0, then he couldn’t have chosen not to get a vaccine and died at T1.
  • If the guy had been born, but had decided to get the vaccine, he would still have been alive at T1 aged 60, but possibly not at T2 aged 80.


Cook, T. D., Scriven, M., Coryn, C. L. S., & Evergreen, S. D. H. (2010). Contemporary Thinking About Causation in Evaluation: A Dialogue With Tom Cook and Michael Scriven. American Journal of Evaluation, 31(1), 105–117.

Halpern, J. Y. (2016). Actual causality. The MIT press.

Actual causes: two examples using the updated Halpern-Pearl definition

Halpern (2015) provides three variants of the Halpern-Pearl definitions of actual causation. I’m trying to get my head around the formalism, which is elegant, concise, and precise, but tedious to use in practice, so I wrote an R script to do the sums. This blog post is not self-contained – you will need to read the original paper for an introduction to the model. However, it works through two examples, which may help if you’re also struggling with the paper.

The second (“updated”) definition of an actual cause asserts that \(\vec{A} = \vec{a}\) is a cause of \(\varphi\) in \((M,\vec{u})\) iff the following conditions hold:

AC1 \((M,\vec{u}) \models (\vec{A} =\vec{a}) \land \varphi\).

This says, if \(\vec{A} = \vec{a}\) is an actual cause of \(\varphi\) then they both hold in the actual world, \((M,\vec{u})\). Note, for this condition, we are just having a look at the model and not doing anything to it.

AC2 There is a partition of the endogenous variables in \(M\) into \(\vec{Z} \supseteq \vec{X}\) and \(\vec{W}\) and there are settings \(\vec{x’}\) and \(\vec{w}\) such that

(a) \((M,\vec{u}) \models [ \vec{X} \leftarrow \vec{x’}, \vec{W} \leftarrow \vec{w}] \neg \varphi\).

So, we’re trying to show that undoing the cause, i.e., setting \(\vec{X}\) to \(\vec{x’} \ne \vec{x}\), prevents the effect. We are allowed to modify \(\vec{W}\) however we want to show this, whilst leaving \(\vec{Z}-\vec{X}\) free to do whatever the model tells these variables to do.

(b) If \((M,\vec{u}) \models \vec{Z} = \vec{z^{\star}}\), for some \(\vec{z^{\star}}\), then for all \(\vec{W’} \subseteq \vec{W}\) and \(\vec{Z’} \subseteq \vec{Z}-\vec{X}\),
\((M,\vec{u}) \models [ \vec{X} \leftarrow \vec{x}, \vec{W’} \leftarrow \vec{w’}, \vec{Z’} \leftarrow \vec{z^{\star}}] \varphi\).

This says, trigger the cause (unlike AC1, we aren’t just looking to see if it holds) and check whether it leads to the effect under all subsets of \(\vec{Z}\) (as per actual world) that aren’t \(\vec{X}\) and all subsets of the modified \(\vec{W}\) that we found for AC2(a). Note how we are setting \(\vec{Z}\) for those subsets, rather than just observing it.

AC3 There is no \(\vec{A’} \subset \vec{A}\) such that \(\vec{A’} = \vec{a’}\) satisfies AC1 and AC2.

This says, there’s no superfluous stuff in \(\vec{A}\). You taking a painkiller and waving a magic wand doesn’t cause your headache to disappear, under AC3, if the painkiller works without the wand.

Example 1: an (actual) actual cause

Let’s give it a go with an overdetermined scenario (lightly edited from Halpern) that Alice and Bob both lob bricks at a glasshouse and smash the glass. Define

\(\mathit{AliceThrow} = 1\)
\(\mathit{BobThrow} = 1\)
\(\mathit{GlassBreaks} = \mathit{max}(\mathit{AliceThrow},\mathit{BobThrow})\)

So, if either Alice or Bob (or both) hit the glasshouse, then the glass breaks. Strictly speaking, I should have setup one or more exogenous variables, \(\vec{u}\), that define the context and then defined \(\mathit{AliceThrow}\) and \(\mathit{BobThrow}\) in terms of \(\vec{u}\), but it works fine to skip that step as I have here since I’m holding \(\vec{u}\) constant anyway.

Is \(\mathit{AliceThrow} = 1\) an actual cause of \(\mathit{GlassBreaks} = 1\)?

AC1 holds since \((M,\vec{u}) \models \mathit{AliceThrow} = 1 \land \mathit{GlassBreaks} = 1\). The first conjunct comes directly from one of the model equations and none of the functions change it. Spelling out the second conjunct,

\(\mathit{GlassBreaks} = \mathit{max}(\mathit{AliceThrow},\mathit{BobThrow})\)
\(= \mathit{max}(1, 1)\)
\(= 1\)

For AC2, we need to find a partition of the endogenous variables such that AC2(a) and AC2(b) hold. Try \(\vec{Z} = \{ \mathit{AliceThrow}, \mathit{GlassBreaks} \}\) and \(\vec{W}= \{ \mathit{BobThrow} \}\).

AC2(a) holds since \((M,\vec{u}) \models [ \mathit{AliceThrow} \leftarrow 0, \mathit{BobThrow} \leftarrow 0] \mathit{GlassBreaks} = 0\).

For AC2(b), we begin with \(\vec{Z} = \{ \mathit{AliceThrow}, \mathit{GlassBreaks} \}\) and the settings as per the unchanged model, so

\((M,\vec{u}) \models \mathit{AliceThrow} = 1 \land \mathit{GlassBreaks} = 1\).

We need to check that for all \(\vec{W’} \subseteq \vec{W}\) and \(\vec{Z’} \subseteq \vec{Z}-\vec{X}\),
\((M,\vec{u}) \models [ \vec{X} \leftarrow \vec{x}, \vec{W’} \leftarrow \vec{w’}, \vec{Z’} \leftarrow \vec{z^{\star}}] \varphi\).

Here are the combinations and \(\varphi \equiv \mathit{GlassBreaks} = 1\) holds for all of them:

\((M,\vec{u}) \models [ \mathit{AliceThrow} \leftarrow 1, \mathit{GlassBreaks} \leftarrow 1, \mathit{BobThrow} \leftarrow 0 ] \varphi\)
\((M,\vec{u}) \models [ \mathit{AliceThrow} \leftarrow 1, \mathit{BobThrow} \leftarrow 0 ] \varphi\)
\((M,\vec{u}) \models [ \mathit{AliceThrow} \leftarrow 1, \mathit{GlassBreaks} \leftarrow 1 ] \varphi\)
\((M,\vec{u}) \models [ \mathit{AliceThrow} \leftarrow 1 ] \varphi \)

(The third was rather trivially true; however, as far as I understand, has to be checked given the definition.)

AC3 is easy since the cause only has one variable, so there’s nothing superfluous.

Example 2: not an actual cause

Now let’s try an example that isn’t an actual cause: the glass breaking causes Alice to throw the brick. It’s obviously false; however, it wasn’t clear to me exactly where it would fail until I worked through this…

AC1 holds since in the actual world, \(\mathit{GlassBreaks} = 1\) and \(\mathit{AliceThrow} = 1\) hold.

Examining the function defintions, they don’t provide a way to link \(\mathit{AliceThrow}\) to a change in \(\mathit{GlassBreaks}\), so the only apparent way to do so is through \(\vec{W}\). Therefore, use the partition \(\vec{W} = \{\mathit{AliceThrow}\}\) and \(\vec{Z} = \{\mathit{GlassBreaks}, \mathit{BobThrow}\}\).

Now for AC2(a), we can easily get \(\mathit{AliceThrow} = 0\) as required, since we can do what we like with \(\vec{W}\). It doesn’t help when we move onto AC2(b) since we have to hold \(\mathit{AliceThrow} = 0\), which is the negation of what we want. The same is the case for the other partition including \(\mathit{AliceThrow}\) in \(\vec{W}\), i.e., \(\vec{W} = \{ \mathit{AliceThrow}, \mathit{BobThrow} \}\).

So, the broken glass does not cause Alice to throw a brick. The setup we needed to get through AC2(a) set us up to fail AC2(b).


Halpern, J. Y. (2015). A Modification of the Halpern-Pearl Definition of Causality. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), 3022–3033.

See also this companion blog post.

What is a counterfactual?

What’s a counterfactual? Philosophers love the example, “If Oswald hadn’t killed Kennedy, someone else would have”. More generally, Y would be y had X been x in situation U = u (Judea Pearl’s, 2011, rendering).


Pearl, J. (2011). The structural theory of causation. In P. McKay Illari, F. Russo, & J. Williamson (Eds.), Causality in the Sciences (pp. 697–727). Oxford University Press.

“Counterfactual” is not a synonym for “control group”

“Counterfactual” is not a synonym for “control group”. In fact, the treatment group’s actual outcomes are used when estimating the control group’s counterfactual outcomes, which is necessary to estimate the average treatment effect on control (ATC) or average treatment effect (ATE) estimands.

An individual’s treatment effect is defined as a within-person difference between the potential outcome following treatment and the potential outcome following control. This individual treatment effect is impossible to measure, since only one potential outcome is realised depending on which group the individual was in. However, various averages of the treatment effects can be estimated.

For ATC, we are interested in estimating averages of these treatment effects for control group participants. We know the control group’s actual outcomes. We also need to answer the counterfactual query:

If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATC using matching, we need to find a treatment group match for each individual in the control group. Those treatment group matches are used to estimate the control group’s counterfactual outcomes.

For ATE, we are interested in estimating averages of these treatment effects for all participants. This means we need a combination of answers to the following counterfactual queries:

(a) If individuals in the treatment group had been assigned control, what would their average outcome have been?

(b) If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATE using matching, each treatment individual needs a control group match and each control group individual needs a treatment group match. So, for ATE, both treatment and control groups could be considered counterfactuals, in the sense that they are both used to estimate the other group’s counterfactual outcomes. However, I think it is clearer if we draw a distinction between group (treatment or control) and what we are trying to estimate using data from a group (actual or counterfactual outcomes).

Mediation analysis effects

Are you wondering how to extract estimates of the following estimands from linear models underlying mediation analyses…?

  • total average causal effect (ACE) / average treatment effect (ATE)
  • average direct effect (ADE)
  • average causal mediation effect (ACME)
  • proportion mediated effect

You might be interested in this example I put together, using {mediation} in R, showing the arithmetic!

Research Design in the Social Sciences

“This book introduces a new way of thinking about research designs in the social sciences. Our hope is that this approach will make it easier to develop and to share strong research designs.

“At the heart of our approach is the MIDA framework, in which a research design is characterized by four elements: a model, an inquiry, a data strategy, and an answer strategy. We have to understand each of the four on their own and also how they interrelate.”

Uses {DeclareDesign} in R.

Blair, G., Coppock, A., & Humphreys, M. (2023). Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.

What is a confounder?

‘We thus proposed that a pre-exposure covariate C be considered a confounder for the effect of A on Y if there exists a set of covariates X such that the effect of the exposure on the outcome is unconfounded conditional on (X, C) but for no proper subset of (X, C) is the effect of the exposure on the outcome unconfounded given the subset.’ (VanderWeele & Shpitser, 2013, p. 215)

VanderWeele, T. J., & Shpitser, I. (2013). On the definition of a confounder. The Annals of Statistics, 41(1), 196–220.

Estimating causal effects with optimization-based methods

Cousineau et al. (2023) compared seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world-inspired data, with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.

The methods they compared were:

Method R package Function used
Approximate residual balancing (ARB) balanceHD 1.0 residualBalance.ate
Covariate balancing propensity score (CBPS) CBPS 0.21 CBPS
Entropy balancing (EBal) ebal 0.1–6 ebalance
Genetic matching (GenMatch) Matching 4.9–9 GenMatch
Kernel balancing (KBal) kbal 0.1 kbal
Stable balancing weights (SBW) sbw 1.1.1 sbw

I’m hearing entropy balancing discussed a lot, so had my eye on this.

Bias was the estimated SATT minus true SATT (i.e., the +/- sign was kept; I’m not sure what to make of that when averaging biases from analyses of multiple datasets). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.

Findings below. N gives the number of datasets out of 7700 where SATT could be estimated; red where my eyebrows were raised and pink for entropy balancing and its RMSE:

    Bias   Time
Method N Mean SD RMSE Mean (sec)
kbal 7700 0.036 0.083 0.091 2521.3
balancehd 7700 0.041 0.099 0.107 2.0
sbw 4513 0.041 0.102 0.110 254.9
cbps_exact 7700 0.041 0.105 0.112 6.4
ebal 4513 0.041 0.110 0.117 0.2
cbps_over 7700 0.044 0.117 0.125 17.3
genmatch 7700 0.052 0.141 0.151 8282.4

This particular implementation of entropy balancing failed to find a solution for about 40% of the datasets! Note, however:

“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).

Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers 🙂

Below is a comparison with a bunch of other methods from the competition, for which findings were already available on a GitHub repo (see Dorie et al., 2019, Table 2 and 3, for more info on each method).

    Bias   95% CI
Method N Mean SD RMSE coverage (%)
bart on pscore 7700 0.001 0.014 0.014 88.4
bart tmle 7700 0.000 0.016 0.016 93.5
mbart symint 7700 0.002 0.017 0.017 90.3
bart mchains 7700 0.002 0.017 0.017 85.7
bart xval 7700 0.002 0.017 0.017 81.2
bart 7700 0.002 0.018 0.018 81.1
sl bart tmle 7689 0.003 0.029 0.029 91.5
h2o ensemble 6683 0.007 0.029 0.030 100.0
bart iptw 7700 0.002 0.032 0.032 83.1
sl tmle 7689 0.007 0.032 0.032 87.6
superlearner 7689 0.006 0.038 0.039 81.6
calcause 7694 0.003 0.043 0.043 81.7
tree strat 7700 0.022 0.047 0.052 87.4
balanceboost 7700 0.020 0.050 0.054 80.5
adj tree strat 7700 0.027 0.068 0.074 60.0
lasso cbps 7108 0.027 0.077 0.082 30.5
sl tmle joint 7698 0.010 0.101 0.102 58.9
cbps 7344 0.041 0.099 0.107 99.7
teffects psmatch 7506 0.043 0.099 0.108 47.0
linear model 7700 0.045 0.127 0.135 22.3
mhe algorithm 7700 0.045 0.127 0.135 22.8
teffects ra 7685 0.043 0.133 0.140 37.5
teffects ipwra 7634 0.044 0.161 0.166 35.3
teffects ipw 7665 0.042 0.298 0.301 39.0

I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair. The simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching!

Interestingly the RMSE was a tiny bit worse for entropy balancing than for Stata’s teffects psmatch, which in simulations was setup to use nearest-neighbour matching on propensity scores estimated using logistic regression (I presume the defaults – I’m an R user).

The winners were all either regression-based or what the authors called “mixed methods” – in this context meaning some genre of doubly-robust method that combined matching/weighting with regression adjustment. Bayesian additive regression trees (BART) feature towards the best end of the table. These sorts of regression-based methods don’t allow the design phase to be clearly separated from the estimation phase. For matching approaches where this separation is possible, the outcomes data can be held back from analysts until matches are found or weights estimated based only on covariates. Where the analysis also demands access to outcomes, a robust approach is needed, including a highly-specified and published statistical analysis plan and e.g., holding back some data in a training and validation phase before fitting the final model.

No info is provided on CI coverage for the seven optimisation-based methods they tested. This is why (Cousineau et al., 2023, p. 377):

“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”

It would be interesting to zoom in on a smaller set of options and datasets and perhaps allow some more researcher input on how analyses are carried out.


Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1). 

Evaluating the arts

I love the arts, particularly bouncy, moderately cheesy dance music experienced in a club. Programme evaluation has been poking at the arts for a while now and attempting to quantify their impact on wellbeing. And I really wish evaluators would stop and think what they’re doing before rushing in with a wellbeing questionnaire. The experience I have dancing in a sweaty club is very different to the experience wandering around Tate or crying in a cinema. Programme effects are contrasts: the actual outcome of the programme versus an estimate of what the outcome would have been in the programme’s absence, e.g., following some genre of “business as usual”. Before we can measure the average causal benefits of the arts we need a sensible theory of change, and some ideas about what the counterfactual is. Maybe you can find out how I feel dancing to Free Yourself, but what’s the contrast? Dancing to something else? Staying in with a cup of tea and chocolate, reading a book about standard errors for matching with replacement? What exactly is the programme being evaluated…? (End of random thoughts.)