Dipping into How to Think Like a Realist

Raw Pawson’s new book (Pawson, 2024) is an introduction to scientific practice, with social science as a corollary, drawing on the philosophy of science and Pawson’s experience conducting applied research. The book is on my reading stack. For now, I have taken a purposive sample of a few episodes (as Pawson calls the chapters) to get a sense of what’s in store. Here are some clues.

The main objective of the book (p. xvi)

“… is revealed in the structure of the text, which journeys across physical science and clinical science, before landing squarely in social science. The coverage here represents a commitment to the ‘unity of science’ (Oppenheim and Putnam, 1958). This heroic proposition claims that there are core explanatory principles which underpin science in all it guises, and the book makes tiny, tentative steps in tracing the common realist tenets. Perforce, I am also committed to the ‘unity of social science’.”

What about those context, mechanism, outcome triads that realist evaluators use? (p. 42)

“All scientific investigation utilises explanations relating mechanisms and contexts to empirical patterns.”

And (p. 48):

“With Harré, I have characterised generative causation in physical science as the analysis of mechanisms, contexts, and regularities (MCR). In clinical research the focus, quite properly, is on mechanisms, contexts, and outcomes (MCO). In social research, it might be wise to begin with the shorthand mechanisms, contexts, and change (MCC).”

(I wonder what the implications of the book are for realist evaluation as a distinct genre – will read those bits with interest.)

On social science and people who don’t follow science (p. xviii):

“Despite the habitual use of the appellation ‘social science’, many of my colleagues would reject any claim to follow science, dismiss any interest in causality, deny any need for objectivity, and scorn the possibility of generalisation. They are beyond hope. I don’t seek to convert them. But in following their chosen paths these various tribes – constructivists, post-modernists, emancipators, critics, essayists, relativists, and so on – have found time to say why causality, objectivity, and generality are false idols. So, in defending the science in social science, their criticisms also need to be overturned.”

More on what Pawson aims to do (pp. 251-252):

“What I’ve come up with here might well be entitled The Old Rules of Sociological Method. I have attempted to extract and justify some realist principles for conducting social research on the back of a generous portfolio of existing examples. Those illustrations reach across many research domains and a broad portfolio of practical methods. But they remain a pinprick; I could have called upon a thousand others. Accordingly, there is another way of perceiving my efforts. The book is no more and no less than an attempt to codify and formalise existing practices. I have tried to capture a tradition. So, just as Monsieur Jourdain spoke prose without knowing what it was, it may well be that you, dear reader, have been thinking like a realist without knowing it!”

I had to google Jourdain. He’s the main character in a comedy by Molière, Le Bourgeois gentilhomme (The Bourgeois Gentleman). The play satirises “the pretensions of the social climber whose affectations are absurd to everyone but himself”, which is a curious reference, dear reader.

To be continued…


Pawson, R. (2024). How to Think Like a realist: A methodology for social science. Edward Elgar Publishing Limited.

LaLonde (1986) after Nearly Four Decades: Lessons Learned (Guido Imbens, Yiqing Xu)

“We show that modern methods, when applied in contexts with significant covariate overlap, yield robust estimates for the adjusted differences between the treatment and control groups. However, this does not mean that these estimates are valid. To assess their credibility, validation exercises (such as placebo tests) are essential, whereas goodness of fit tests alone are inadequate. Our findings highlight the importance of closely examining the assignment process, carefully inspecting overlap, and conducting validation exercises when analyzing causal effects with nonexperimental data.”


Ioannidis and Psillos (2018) on mechanisms

“Mechanisms are causal pathways described in theoretical language that have certain functions; these descriptions can be enriched by offering more detailed or fine‐grained descriptions; the same mechanism can then be described at various levels using different theoretical vocabularies (e.g., cytological vs biochemical descriptions in the case of apoptosis); lastly, the descriptions of biomedically important mechanisms are often such that they contain specific causal information that can be used to make interventions for therapeutic purposes.” (Ioannidis & Psillos, 2018, pp. 1180–1)

Ioannidis, S., & Psillos, S. (2018). Mechanisms in practice: A methodological approach. Journal of Evaluation in Clinical Practice, 24(5), 1177–1183.

RCT of the ZOE nutrition app – and critical analysis

Abstract: “Large variability exists in people’s responses to foods. However, the efficacy of personalized dietary advice for health remains understudied. We compared a personalized dietary program (PDP) versus general advice (control) on cardiometabolic health using a randomized clinical trial. The PDP used food characteristics, individual postprandial glucose and triglyceride (TG) responses to foods, microbiomes and health history, to produce personalized food scores in an 18-week app-based program. The control group received standard care dietary advice (US Department of Agriculture Guidelines for Americans, 2020–2025) using online resources, check-ins, video lessons and a leaflet. Primary outcomes were serum low-density lipoprotein cholesterol and TG concentrations at baseline and at 18 weeks. Participants (n = 347), aged 41–70 years and generally representative of the average US population, were randomized to the PDP (n = 177) or control (n = 170). Intention-to-treat analysis (n = 347) between groups showed significant reduction in TGs (mean difference = −0.13 mmol l−1; log-transformed 95% confidence interval = −0.07 to −0.01, P = 0.016). Changes in low-density lipoprotein cholesterol were not significant. There were improvements in secondary outcomes, including body weight, waist circumference, HbA1c, diet quality and microbiome (beta-diversity) (P < 0.05), particularly in highly adherent PDP participants. However, blood pressure, insulin, glucose, C-peptide, apolipoprotein A1 and B, and postprandial TGs did not differ between groups. No serious intervention-related adverse events were reported. Following a personalized diet led to some improvements in cardiometabolic health compared to standard dietary advice. ClinicalTrials.gov registration: NCT05273268.”

Bermingham, K. M., Linenberg, I., Polidori, L., Asnicar, F., Arrè, A., Wolf, J., Badri, F., Bernard, H., Capdevila, J., Bulsiewicz, W. J., Gardner, C. D., Ordovas, J. M., Davies, R., Hadjigeorgiou, G., Hall, W. L., Delahanty, L. M., Valdes, A. M., Segata, N., Spector, T. D., & Berry, S. E. (2024). Effects of a personalized nutrition program on cardiometabolic health: A randomized controlled trial. Nature Medicine.

And here is a blog post that provides an in-depth critique. The key issue is the control group and how small a role the specific elements of Zoe plays, as illustrated in this pic:


Let’s not replace “impact evaluation” with “contribution analysis”

Giel Ton has written an interesting blog post arguing that we should shift from talking about “impact evaluation” to “contribution analysis”, in the form devised by Mayne. Ton defines contribution analysis as following this process:

“make a good theory of change, identify key assumptions in this theory of change, and focus M&E and research on these key assumptions.”

My first thought was, this definition is remarkably broad! It’s the same as for any theory-based approach (or theory-driven – evaluation is awash with synonyms) where you start with a theory of change (ToC) and test and refine it. See, e.g., what Fitz-Gibbon and Morris (1975), Chen and Rossi (1980), and many others were proposing before Mayne. They all criticise “black box” approaches that lob methods at a programme before stopping to think what it might do and how, so I wondered what makes Ton’s (and/or Mayne’s) proposal different to these broad umbrella approaches that include all methods, mixed, blended, interwoven, shaken, or stirred – so long as a ToC is used throughout.

One recurring issue is people endlessly rocking up with yet another panacea: “Behold! ACME Programme™ will finally sort out your social problem!” Effect size, 0.1 SDs, if you’re lucky. A piece by Thomas Delahais (2023) helped clarify for me what’s different about the contribution analysis approach and how it helps address the panacea phenomenon: alternative explanations of change are treated as being as important as the new programme being investigated. That’s a fun challenge, for all evaluation approaches, qual, quant and RCTs included. For instance, we would design statistical analyses to tell us something about mechanisms that are involved in a range of activities in and around a new programme. We would explore how a new programme interacts with existing activities. These ideas sound very sensible to me – and are often done through implementation and process evaluation. But taking seriously the broader context and alternative explanations of change is much broader than contribution analysis. We might call the activity something like “evaluation”.


Chen, H.-T., & Rossi, P. H. (1980). The Multi-Goal, Theory-Driven Approach to Evaluation: A Model Linking Basic and Applied Social Science. Social Forces, 59, 106–122.

Delahais, T. (2023). Contribution Analysis. LIEPP Methods Brief, 44.

Fitz-Gibbon, C. T., & Morris, L. L. (1975). Theory-based evaluation. Evaluation Comment, 5(1), 1–4. Reprinted in Fitz-Gibbon, C. T., & Morris, L. L. (1996). Theory-based evaluation. Evaluation Practice, 17(2), 177–184.

Hedges’ g for multilevel models in R {lmeInfo}

This package looks useful (for {nlme} not {lme4}).

“Provides analytic derivatives and information matrices for fitted linear mixed effects (lme) models and generalized least squares (gls) models estimated using lme() (from package ‘nlme’) and gls() (from package ‘nlme’), respectively. The package includes functions for estimating the sampling variance-covariance of variance component parameters using the inverse Fisher information. The variance components include the parameters of the random effects structure (for lme models), the variance structure, and the correlation structure. The expected and average forms of the Fisher information matrix are used in the calculations, and models estimated by full maximum likelihood or restricted maximum likelihood are supported. The package also includes a function for estimating standardized mean difference effect sizes (Pustejovsky, Hedges, and Shadish (2014) <doi:10.3102/1076998614547577>) based on fitted lme or gls models.”

Why does everyone love a good RCT?

The individual treatment effect is defined as an individual’s potential outcome under treatment minus their potential outcome under control. This within-participant difference cannot be directly measured since only one of the two potential outcomes is realised depending on whether the participant was exposed to treatment or control.

Everyone loves a good randomised controlled trial because the mean outcome of people who were exposed to treatment minus the mean outcome of people who were exposed to control – a between-participant difference – is an unbiased estimator of the mean of within-participant individual treatment effects.

I’ve coded up a simulation in R over here to illustrate how they work. Note in particular the importance of confidence intervals!

On the parallel trends assumption in difference-in-differences (diff-in-diffs)

“The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.”
        – Bertrand Russell (1912/2001, p. 35)

The parallel trends assumption of difference-in-differences (diff-in-diffs) is that the average outcomes for intervention and comparison groups would have continued in parallel from pre- to post-intervention if the intervention had not been introduced. This assumption cannot be directly tested, since when diff-in-diffs is used, the intervention is introduced. However, a case is often made that parallel trends probably holds (or doesn’t not hold) by analysing pre-intervention trends.

The mystery graph below shows an example from Kahn-Lang and Lang (2020, p. 618), redrawn to add some suspense:

The averages for the two groups (A and B) are practically identical and remain parallel. I can also reveal that there is a large number of observations – enough to be really confident that the lines are parallel. Given data like this, many of us would be confident that we had found no evidence against parallel trends.

Alas, once the time series is extended, we see that the averages significantly diverge. Adding titles to the graph reveals why – it shows median height by gender from the ages 5 to 19:

Growth reference data from WHO; see percentiles for girls and boys xlsx files over here

Around age 9, the median girls’ height begins to exceed boys’, the difference peaking at about 12 years old. Then the difference in medians decreases until around 13 when boys’ median height begins to exceeds girls’.

Clearly, if we wanted to evaluate, e.g., an intervention to boost children’s height, we wouldn’t compare the mean height of one gender with another as control. The biological processes underpinning gender differences in pubertal growth spurt are well-known. However, diff-in-diffs is often applied in situations where much less is known about the dynamics of change over time.

As this example illustrates, the more we know about the topic under investigation and the more covariates we have at our disposal for choosing comparison units, the better our causal estimates using diff-in-diffs are likely to be. Diff-in-diffs can also be combined with matching or weighting on covariates to help construct a comparison group such that parallel trends is more likely to hold; see, e.g., Huntington-Klein (2022, Section 18.3.2).


Huntington-Klein, N. (2022). The effect: An introduction to research design and causality. CRC Press.

Kahn-Lang, A., & Lang, K. (2020). The Promise and Pitfalls of Differences-in-Differences: Reflections on 16 and Pregnant and Other Applications. Journal of Business & Economic Statistics, 38(3), 613–620.

Russell, B. (1912/2001). The problems of philosophy. Oxford University Press.

People still read blogs!

Thanks very much to Thomas Aston (2024) for critical engagement in the Evaluation journal:

“… there were somewhat more thoughtful debates on the integration of experiments with theory-based evaluation. Of course, this is not a new discussion, but it reemerged, as I discussed in Randomista mania (Aston, 2023c), during a Kantar Public (2023) (now Verian) webinar on ensuring rigor in theory-based evaluation. In the United Kingdom, the Magenta Book guidance from HM Treasury (2020) includes a decision tree which implies, to some readers, that experimental designs cannot be theory-based. During the event, Alex Hurrell pointed out that theory-based evaluation and experimental methods are not necessarily irreconcilable. To this end, Andi Fugard (2023) wrote a blog arguing that “counterfactual” is not synonymous with “control group” and later conducted a thoughtful webinar for the UK Evaluation Society (2023) on challenging the theory-based counterfactual binary. In my view, Fugard is right that there is not a strict binary which implies that counterfactual approaches should not be theory-based. They have been moving in that direction for years (White, 2009). But perhaps the decision tree is less about the benefits of integrating theory into counterfactual approaches and more about the epistemic, practical, and ethical limits of experimental impact evaluation approaches and the importance of exploring alternative options when they are neither possible nor appropriate.”

I wish I shared Thomas’s optimism that the theory-based/counterfactual binary is already blurring. My reading is, the original 1975 definition of theory-based evaluation was inclusive and still is for those in the theory-driven camp (e.g., Huey Chen‘s work). But in many UK evaluation contexts, theory-driven is synonymous with contribution analysis, qualitative comparative analysis, and process tracing, applied to qualitative data. RCTs and QEDs are not allowed. There are notable exceptions.