The Turing Way handbook to reproducible, ethical and collaborative data science

“The Turing Way project is open source, open collaboration, and community-driven. We involve and support a diverse community of contributors to make data science accessible, comprehensible and effective for everyone. Our goal is to provide all the information that researchers and data scientists in academia, industry and the public sector need to ensure that the projects they work on are easy to reproduce and reuse.”


Let’s not replace “impact evaluation” with “contribution analysis”

Giel Ton has written an interesting blog post arguing that we should shift from talking about “impact evaluation” to “contribution analysis”, in the form devised by Mayne. Ton defines contribution analysis as following this process:

“make a good theory of change, identify key assumptions in this theory of change, and focus M&E and research on these key assumptions.”

My first thought was, this definition is remarkably broad! It’s the same as for any theory-based approach (or theory-driven – evaluation is awash with synonyms) where you start with a theory of change (ToC) and test and refine it. See, e.g., what Fitz-Gibbon and Morris (1975), Chen and Rossi (1980), and many others were proposing before Mayne. They all criticise “black box” approaches that lob methods at a programme before stopping to think what it might do and how, so I wondered what makes Ton’s (and/or Mayne’s) proposal different to these broad umbrella approaches that include all methods, mixed, blended, interwoven, shaken, or stirred – so long as a ToC is used throughout.

One recurring issue is people endlessly rocking up with yet another panacea: “Behold! ACME Programme™ will finally sort out your social problem!” Effect size, 0.1 SDs, if you’re lucky. A piece by Thomas Delahais (2023) helped clarify for me what’s different about the contribution analysis approach and how it helps address the panacea phenomenon: alternative explanations of change are treated as being as important as the new programme being investigated. That’s a fun challenge, for all evaluation approaches, qual, quant and RCTs included. For instance, we would design statistical analyses to tell us something about mechanisms that are involved in a range of activities in and around a new programme. We would explore how a new programme interacts with existing activities. These ideas sound very sensible to me – and are often done through implementation and process evaluation. But taking seriously the broader context and alternative explanations of change is much broader than contribution analysis. We might call the activity something like “evaluation”.


Chen, H.-T., & Rossi, P. H. (1980). The Multi-Goal, Theory-Driven Approach to Evaluation: A Model Linking Basic and Applied Social Science. Social Forces, 59, 106–122.

Delahais, T. (2023). Contribution Analysis. LIEPP Methods Brief, 44.

Fitz-Gibbon, C. T., & Morris, L. L. (1975). Theory-based evaluation. Evaluation Comment, 5(1), 1–4. Reprinted in Fitz-Gibbon, C. T., & Morris, L. L. (1996). Theory-based evaluation. Evaluation Practice, 17(2), 177–184.

Hedges’ g for multilevel models in R {lmeInfo}

This package looks useful (for {nlme} not {lme4}).

“Provides analytic derivatives and information matrices for fitted linear mixed effects (lme) models and generalized least squares (gls) models estimated using lme() (from package ‘nlme’) and gls() (from package ‘nlme’), respectively. The package includes functions for estimating the sampling variance-covariance of variance component parameters using the inverse Fisher information. The variance components include the parameters of the random effects structure (for lme models), the variance structure, and the correlation structure. The expected and average forms of the Fisher information matrix are used in the calculations, and models estimated by full maximum likelihood or restricted maximum likelihood are supported. The package also includes a function for estimating standardized mean difference effect sizes (Pustejovsky, Hedges, and Shadish (2014) <doi:10.3102/1076998614547577>) based on fitted lme or gls models.”

How Many Imputations Do You Need? {howManyImputations}

“When performing multiple imputations, while 5-10 imputations are sufficient for obtaining point estimates, a larger number of imputations are needed for proper standard error estimates. This package allows you to calculate how many imputations are needed, following the work of von Hippel (2020).”

Useful example here.


Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.

Why does everyone love a good RCT?

The individual treatment effect is defined as an individual’s potential outcome under treatment minus their potential outcome under control. This within-participant difference cannot be directly measured since only one of the two potential outcomes is realised depending on whether the participant was exposed to treatment or control.

Everyone loves a good randomised controlled trial because the mean outcome of people who were exposed to treatment minus the mean outcome of people who were exposed to control – a between-participant difference – is an unbiased estimator of the mean of within-participant individual treatment effects.

I’ve coded up a simulation in R over here to illustrate how they work. Note in particular the importance of confidence intervals!

Causal Models and Metaphysics – two interesting papers by Jenn McDonald

These look fun, by Jenn McDonald:

Causal Models and Metaphysics – Part 1: Using Causal Models

“This paper provides a general introduction to the use of causal models in the metaphysics of causation, specifically structural equation models and directed acyclic graphs. It reviews the formal framework, lays out a method of interpretation capable of representing different underlying metaphysical relations, and describes the use of these models in analyzing causation.”

Causal Models and Metaphysics – Part 2: Interpreting Causal Models

“This paper addresses the question of what constitutes an apt interpreted model for the purpose of analyzing causation. I first collect universally adopted aptness principles into a basic account, flagging open questions and choice points along the way. I then explore various additional aptness principles that have been proposed in the literature but have not been widely adopted, the motivations behind their proposals, and the concerns with each that stand in the way of universal adoption. I conclude that the remaining work of articulating aptness for a SEM analysis of causation is tied up with issue to do with modality, ontology, and mereology. Continuing this work is therefore likely to shed light on the relationship between these areas and causation more generally.”



On the parallel trends assumption in difference-in-differences (diff-in-diffs)

“The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.”
        – Bertrand Russell (1912/2001, p. 35)

The parallel trends assumption of difference-in-differences (diff-in-diffs) is that the average outcomes for intervention and comparison groups would have continued in parallel from pre- to post-intervention if the intervention had not been introduced. This assumption cannot be directly tested, since when diff-in-diffs is used, the intervention is introduced. However, a case is often made that parallel trends probably holds (or doesn’t not hold) by analysing pre-intervention trends.

The mystery graph below shows an example from Kahn-Lang and Lang (2020, p. 618), redrawn to add some suspense:

The averages for the two groups (A and B) are practically identical and remain parallel. I can also reveal that there is a large number of observations – enough to be really confident that the lines are parallel. Given data like this, many of us would be confident that we had found no evidence against parallel trends.

Alas, once the time series is extended, we see that the averages significantly diverge. Adding titles to the graph reveals why – it shows median height by gender from the ages 5 to 19:

Growth reference data from WHO; see percentiles for girls and boys xlsx files over here

Around age 9, the median girls’ height begins to exceed boys’, the difference peaking at about 12 years old. Then the difference in medians decreases until around 13 when boys’ median height begins to exceeds girls’.

Clearly, if we wanted to evaluate, e.g., an intervention to boost children’s height, we wouldn’t compare the mean height of one gender with another as control. The biological processes underpinning gender differences in pubertal growth spurt are well-known. However, diff-in-diffs is often applied in situations where much less is known about the dynamics of change over time.

As this example illustrates, the more we know about the topic under investigation and the more covariates we have at our disposal for choosing comparison units, the better our causal estimates using diff-in-diffs are likely to be. Diff-in-diffs can also be combined with matching or weighting on covariates to help construct a comparison group such that parallel trends is more likely to hold; see, e.g., Huntington-Klein (2022, Section 18.3.2).


Huntington-Klein, N. (2022). The effect: An introduction to research design and causality. CRC Press.

Kahn-Lang, A., & Lang, K. (2020). The Promise and Pitfalls of Differences-in-Differences: Reflections on 16 and Pregnant and Other Applications. Journal of Business & Economic Statistics, 38(3), 613–620.

Russell, B. (1912/2001). The problems of philosophy. Oxford University Press.


From Andrew Scott and Olivia Colman See You Taking Creepshots in Interview Magazine (April 4, 2024):

OLIVIA COLMAN: I’ve just done a week of press and Jessie [Buckley] and I were going a bit stir-crazy with the same questions. My lovely PR said, “Should we turn it into a drinking game?” So she gave all the journalists mimosas and said, “If anybody asks the same question, you have to drink.” Then the whole thing was fun.

ANDREW SCOTT: I bet you were hammered.

COLMAN: Yeah. It was brilliant.

SCOTT: And I bet the question that you drank the most to was, “What did you think when you first read the script?”

COLMAN: Oh my god. That’s it. Or, “What drew you to this?”

SCOTT: Yeah, me and Paul [Mescal] had it just recently on All of Us Strangers and that was our big one. But it’s amazing when somebody asks you a really offbeat question. You’re immediately energized by it, because the thing is, you’re never going to say, “Well, I was the seventh choice for this part, and I did this because I’m in a huge amount of debt.”

COLMAN: I have occasionally said that. “I had a huge tax bill, so it came at the right time.”

SCOTT: [Laughs] But it’s a very weird thing to try and speak about how absolutely feral you feel sometimes in that interview room. Do you know what the other one is? “Any funny stories?”


S-values are a neat idea for helping to think about – maybe even feel – the meaning of p-values. They are described by Rafi and Greenland (2020). This post explains the basic idea, beginning with flips of a coin.

  • Suppose you flip an apparently fair coin and the outcome is heads. How surprised would you feel?
  • Now flip it twice: both outcomes are heads. How surprised would you feel now?
  • Flip the coin 40 times. All of the outcomes are heads. How surprised are you now?

I suspect your level of surprise has gone from something like “meh” through to “surely this isn’t a fair coin?!”

S-values (the S is for surprisal) provide a way to think about p-values in terms of how likely it would be to get a sequence of all heads from a number of flips of a fair coin. That number of flips is the s-value and can be calculated from the p-value.

Here is an example of a coin flipped three times. There are \(2^3 = 8\) possible outcomes, listed in the table below:

First coin flip Second coin flip Third coin flip

If the coin is fair, then the probability of each of these outcomes is \(\frac{1}{8}\). In particular, the probability of all heads is also \(\frac{1}{8}\), or 0.125.

More generally, the probability of getting all heads from \(n\) fair coin flips is \(\frac{1}{2^n}\). Here is a table showing some examples:

Flips Probability all heads
1 0.5
2 0.25
3 0.125
4 0.0625
5 0.03125
6 0.01562
7 0.00781
8 0.00391
9 0.00195
10 0.00098

Now here is the connection with p-values. Suppose you run a statistical test and get \(p = 0.03125\); that’s the same probability as that of obtaining five heads in a row from five coin tosses. The s-value is 5.

Or suppose you merely got \(p = 0.5\). That’s the probability of obtaining heads after one flip of the coin. The s-value is 1.

The larger the s-value, the more surprised you would be if the coin were fair.

To convert p-values to s-values, we want to find an \(s\), such that

\(\displaystyle \frac{1}{2^s} = p\).

The log function (base 2) does this for us:

\(\displaystyle s = -\log_2(p)\).

What about the traditional (or notorious?) 0.05 level?

\(-\log_2(0.05) = 4.32\),

to two decimal places. So, that’s the same as getting all heads when you flip a coin 4.32 times – which isn’t entirely intuitive when expressed as coin flips. But you could think of it being a little more surprising than getting four heads in a row if you flipped a fair coin four times.


Rafi, Z., Greenland, S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol 20, 244 (2020).