Carol Fitz-Gibbon (1938 – 2017), author of first description of theory-based evaluation, on importance of RCTs

“[…] I produced the first description of theory based evaluation […]. The point of theory based evaluation is to see, firstly, to what extent the theory is being implemented and, secondly, if the predicted outcomes then follow. It is particularly useful as an interim measure of implementation when the outcomes cannot be measured until much later. But most (if not all) theories in social science are only sets of persuasively stated hypotheses that provide a temporary source of guidance. In order to see if the hypotheses can become theories one must measure the extent to which the predicted outcomes are achieved. This requires randomised controlled trials. Even then the important point is to establish the direction and magnitude of the causal relation, not the theory. Many theories can often fit the same data.”

Fitz-Gibbon, C. T. (2002). Researching outcomes of educational interventions. BMJ, 324(7346), 1155.

It’s all theory-based and counterfactual

Two of my favourite articles on evaluation are Cook’s (2000) argument that all impact evaluations, RCTs included, are theory-based and Reichardt’s (2022) argument that there’s always a counterfactual, if not explicitly articulated then not far beneath the surface. I think both arguments are irrefutable, but how we can build on theirs and others’ work to improve evaluation commissioning and delivery seems a formidable challenge given the fiercely defended dichotomies in the field.

If all impact evaluation really is theory-based then it’s clear there’s huge variation in the quality of theories and theorising. If all impact evaluation depends on counterfactuals then there is huge variation in how compelling the evidence is for the counterfactual outcomes, particularly when there is no obvious comparison group.

Clarifying these kinds of distinctions is, I think, important for improving evaluations and the public services and other programmes they evaluate.


Cook, T. D. (2000). The false choice between theory-based evaluation and experimentation. In A. Petrosino, P. J. Rogers, T. A. Huebner, & T. A. Hacsi (Eds.), New directions in evaluation: Program Theory in Evaluation: Challenges and Opportunities (pp. 27–34). Jossey-Bass.

Reichardt, C. S. (2022). The Counterfactual Definition of a Program Effect. American Journal of Evaluation, 43(2), 158–174.

A cynical view of SEMs

It is all too common for a box and arrow diagram to be cobbled together in an afternoon and christened a “theory of change”. One formalised version of such a diagram is a structural equation model (SEM), the arrows of which are annotated with coefficients estimated using data. Here is John Fox (2002) on SEM and informal boxology:

“A cynical view of SEMs is that their popularity in the social sciences reflects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis.”


Fox, J. (2002). Structural Equation Models: Appendix to An R and S-PLUS Companion to Applied Regression. Last corrected 2006.

Beautiful friendships have been jeopardised

This is an amusing opening to a paper on face validity, by Mosier (1947):

“Face validity is a term that is bandied about in the field of test construction until it seems about to become a part of accepted terminology. The frequency of its use and the emotional reaction which it arouses-ranging almost from contempt to highest approbation-make it desirable to examine its meaning more closely. When a single term variously conveys high praise or strong condemnation, one suspects either ambiguity of meaning or contradictory postulates among those using the term. The tendency has been, I believe, to assume unaccepted premises rather than ambiguity, and beautiful friendships have been jeopardized when a chance remark about face validity has classed the speaker among the infidels.”

I think dozens of beautiful friendships have been jeopardized by loose talk about randomised controlled trials, theory-based evaluation, realism, and positivism, among many others. I’ve just seen yet another piece arguing that you wouldn’t evaluate a parachute with an RCT and I can’t even.


Mosier, C. I. (1947). A Critical Examination of the Concepts of Face Validity. Educational and Psychological Measurement, 7(2), 191–205.

Counterfactual evaluation

Consider the following two sentences:

(1) Alex’s train left 2 minutes before they arrived at the platform.

(2) If Alex had arrived at the platform 10 minutes earlier, then they probably would have caught their train.

Is the counterfactual in sentence 2 true or false, or can’t you tell because you didn’t run an RCT?

I reckoned that the counterfactual is true. I reasoned that Alex probably missed the train because they were late, so turning up earlier would have fixed that.

I could think of other possible outcomes, but they became increasingly contrived and far from the (albeit minimal) evidence provided. For instance, it is conceivable that if Alex arrived earlier, they would have believed they had time to pop to Pret for a coffee – and missed the train again.

Applying process tracing to RCTs

Process tracing is an application of Bayes’ theorem to test hypotheses using qualitative evidence.¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. This  means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.²

Next, let’s run the mediation analysis. It is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been statistically significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Is this process tracing or simply putting Bayes’ rule to work as usual? Does this example show that RCTs can be theory-based evaluations, since process tracing is a theory-based method, or does the inclusion of a control group rule out that possibility, as Figure 3.1 of the Magenta Book would suggest? I will leave the reader to assign probabilities to each possible conclusion. Let me know what you think.

¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully all is clear given the prose.

For example, P(Hyp = Med | ATE = Alternative) =  47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Special issue dedicated to John Mayne

‘I am honoured to introduce this special issue dedicated to John Mayne, a “thought leader,” “practical thinker,” “bridge builder,” and “scholar practitioner” in the field of evaluation. Guest editors Steffen Bohni Nielsen, Sebastian Lemire, and Steve Montague bring together 14 colleagues whose articles document, analyze, and expand on John’s contributions to evaluation in the Canadian public service as well as his contributions to evaluation theory.’ –Jill A. Chouinard

Canadian Journal of Program Evaluation, Volume 37 Issue 3, March 2023

Theory-based vs. theory-driven evaluation

“Donaldson and Lipsey (2006), Leeuw and Donaldson (2015), and Weiss (1997) noted that there is a great deal of confusion today about what is meant by theory-based or theory-driven evaluation, and the differences between using program theory and social science theory to guide evaluation efforts. For example, the newcomer to evaluation typically has a very difficult time sorting through a number of closely related or sometimes interchangeable terms such as theory-oriented evaluation, theory-based evaluation, theory-driven evaluation, program theory evaluation, intervening mechanism evaluation, theoretically relevant evaluation research, program theory, program logic, logic modeling, logframes, systems maps, and the like. Rather than trying to sort out this confusion, or attempt to define all of these terms and develop a new nomenclature, a rather broad definition is offered in this book in an attempt to be inclusive.

“Program Theory–Driven Evaluation Science is the systematic use of substantive knowledge about the phenomena under investigation and scientific methods to improve, to produce knowledge and feedback about, and to determine the merit, worth, and significance of evaluands such as social, educational, health, community, and organizational programs.”

– Donaldson, S. I. (2022, p. 9). Introduction to Theory-Driven Program Evaluation (2nd ed.). Routledge.

My vote is for program/programme evaluation 😉

Seven ways to estimate a counterfactual

Experimental and quasi-experimental evaluations usually define a programme effect as the difference between (a) the actual outcome following a social programme and (b) an estimate of what the outcome would have been without the programme – the counterfactual outcome. (The latter might be a competing programme or some genre of “business as usual”.)

It is also usually argued that qualitative or so-called “theory-based” approaches to evaluation are not counterfactual evaluations. Reichardt (2022) adds to a slowly accumulating body of work that challenges this and argues that any approach to evaluation can be understood in counterfactual terms.

Reichardt provides seven examples of evaluation approaches, quantitative and qualitative, and explains how a counterfactual analysis is relevant:

  1. Comparisons Across Participants. RCTs and friends. The comparison group is used to estimate the counterfactual. (Note: the comparison group is not the counterfactual. A comparison group is factual.)
  2. Before-After Comparisons. The baseline score is often treated as counterfactual outcome (though it’s probably not, thanks, e.g., due to regression to the mean).
  3. What-If Assessments. Asking participants to reflect on a counterfactual like, “How would you have felt without the programme?” Participants provide the estimate of the counterfactual, the evaluators use it to estimate the effect.
  4. Just-Tell-Me Assessments. Cites Copestake (2014): “If we are interested in finding out whether particular men, women or children are less hungry as a result of some action it seems common-sense just to ask them.” In this case participants may be construed as carrying out the “What-If” assessment of the previous point and using this to work out the programme effect themselves.
  5. Direct Observation. Simply seeing the causal effect rather than inferring. An example given is of tapping a car brake and seeing the effect. Not sure I buy this one and neither does Reichardt. Whatever it is, I agree a counterfactual of some sort is needed (and inferred): you need to have a theory to explain what would have happened had you not tapped the brake.
  6. Theories-of-Change Assessments. Contribution analysis and realist evaluation are offered as examples. The gist is, despite what proponents of these approaches claim, to use a theory of change to work out whether the programme is responsible for or “contributes to” outcomes, you need to use the theory of change to think about the counterfactual. I’ve blogged about realist evaluation and contribution analysis elsewhere and their definitions of a causal effect.
  7. The Modus Operandi (MO) Method. The evaluator looks for evidence of traces or tell-tales that the programme worked. Not sure I quite get how this differs from theory-of-change assessments. Maybe it doesn’t. It sounds like potentially another way to evidence the causal chains in a theory of change.

The conclusion:

“I suspect there is no viable alternative to the counterfactual definition of an effect and that when the counterfactual definition is not given explicitly, it is being used implicitly. […] Of course, evaluators are free to use an alternative to the counterfactual definition of a program effect, if an adequate alternative can be found. But if an alternative definition is used, evaluators should explicitly describe that alternative definition and forthrightly demonstrate how their definition undergirds their methodology […].”

I like four of the seven, as kinds of evidence used to infer the counterfactual outcome. I also propose a fifth: evaluator opinion.

  1. Comparisons Across Participants.
  2. Before-After Comparisons.
  3. What-If Assessments.
  4. Just-Tell-Me Assessments.
  5. Evaluator opinion.

The What-If and Just-Tell-Me assessments could involve subject experts rather than only beneficiaries of a programme, which would have an impact on how those assessments are interpreted, particularly if the experts have a vested interest. To me, the Theory of Change Assessment in Reichardt’s original could be carried out with the help of one or more of these five. They are all ways to justify causal links (mediating variables or intermediate variables), not just evaluate outcomes, and help assess the validity of a theory of change. Though readers may not find them all equally compelling, particularly the last.


Copestake, J. (2014). Credible impact evaluation in complex contexts: Confirmatory and exploratory approaches. Evaluation, 20(4), 412–427.

Reichardt, C. S. (2022). The Counterfactual Definition of a Program Effect. American Journal of Evaluation43(2), 158–174.