Challenging the theory-based/counterfactual binary

Randomised controlled trials (RCTs) are an instance of the counterfactual kind of impact evaluation and contribution analysis of the theory-based kind. For example, the UK’s Magenta Book recommends using theory-based evaluation if you can’t find a comparison group (HM Treasury, 2020, p. 47). The founding texts on contribution analysis present it as using a non-counterfactual approach to causation (e.g., Mayne, 2019, pp. 173–4). In December 2023, I gave a talk to the UK Evaluation Society (UKES) exploring what happens if we challenge this theory-based/counterfactual binary. This post summarises what I said.

Good experiments and quasi-experiments are theory-based

Theory is required to select the variables used in RCTs and quasi-experimental designs (QEDs); that is, to decide what data to gather and include in analyses. We need to peek inside the “black box” of a programme to work out what we are evaluating and how. These variables include outcomes, moderators, mediators, competing exposures, and, in QEDs, confounders.

This diagram denotes a causal directed acyclic graph (causal DAG). Therapy is conjectured to lead to behavioural activation, which alleviates depression. Supportive friends are a “competing exposure” that can alleviate depression. Reading self-help texts is a confounder as it predicts both whether someone seeks therapy and their outcomes, so this (made up) model is of a quasi-experimental evaluation.

In QEDs, many different model structures are compatible with the data. The slogan “correlation doesn’t imply causation” can be reformulated as A causes B and B causes A are Markov equivalent if all we know is that A and B are correlated. The number of models that are Markov equivalent rises exponentially as the number of variables and statistical associations increases. Theory is needed to select between the models: the data alone cannot tell us.

Although RCTs give us an unbiased estimate of an average treatment effect, that is, the average of each individual’s unmeasurable treatment effect (difference between measured actual outcome and counterfactual outcome), they cannot tell us what that difference represents. To do that, we need a theory of the ingredients and processes in the two programmes being compared; for example, what are the similarities and differences between CBT and humanistic counselling or whatever “treatment as usual” is in practice.

Counterfactual evaluations do not need a comparison group

There is a long history of research on counterfactual reasoning in the absence of a control group. How should we determine the truth of a counterfactual such as “If Oswald had not killed Kennedy, no one else would have” (e.g., Adams, 1970)? Clearly we ponder statements like these without running violent RCTs. Another strand of research investigates how children, adults, and animals perform counterfactual reasoning in practice (e.g., Rafetseder, et al., 2010). This research on non-experimental counterfactual reasoning includes literature in evaluation, for instance by White (2010) and more recently Reichardt (2022).

Halpern (2016) introduces a formal framework for defining causal relationships and estimating counterfactual outcomes, regardless of the sources of evidence. The diagram below illustrates the simplest version of the framework where causes and effects are binary. Equations, annotating each node, define the causal relationships and allow counterfactual outcomes to be inferred.

This is one of the models I used to illustrate counterfactual inference without a comparison group. In the factual situation, Alex was feeling down, spoke to their counsellor, and then felt better. In this counterfactual scenario, we are exploring what would have happened if Alex hadn’t spoken to their counsellor. In this case, they would have spoken to their friend instead and again felt better. Although there is no difference between the factual and counterfactual outcome, Halpern’s framework allows us to infer that Alex speaking to the counsellor was an actual cause of Alex feeling better.

This provides a concrete illustration of why RCTs and QEDs are unnecessary to reason counterfactually; however, the causal model obviously must be correct for the correct conclusions to be drawn, so one might reasonably ask what sorts of evidence we can use to build these models and persuade people they are true. There is also hearty debate in the literature concerning what constitutes an “actual cause” and algorithms for determining this are fiddly to apply, even if it is assumed we have a true model of the causal relationships. I explored Halpern’s approach in a previous blog post.

Summary

The original definition of theory-based evaluation by Fitz-Gibbon and Morris (1975) included the full range of approaches, qualitative and quantitative – including RCTs and QEDs. Many others follow in the tradition. For instance, Weiss (1997, p. 512) cites path analysis (a kind of structural equation model) as being “conceptually compatible with TBE [theory-based evaluation] and has been used by evaluators”. Chen and Rossi (1980) explain how theory-based (what they term theory-driven) RCTs that include well-chosen covariates (competing exposures) yield more precise estimates of effects (reducing the probability of Type II error), even though those covariates are not needed to control Type I error. Counterfactual queries do not need a comparison group. They do need a model of how facts came about that can be modified to predict the counterfactual outcome.

Challenging the theory-based/counterfactual binary does not mean that all evaluations are the same. There can still obviously be variation in the strength of evidence used to develop and test theories and how well theories withstand those tests. However, taking a more nuanced view of the differences and similarities between approaches leads to better evaluations.

(If you found this post interesting, please do say hello and let me know!)

References

Adams, E. W. (1970). Subjunctive and Indicative Conditionals. Foundations of Language, 6, 89–94.

Chen, H.-T., & Rossi, P. H. (1983). Evaluating With Sense: The Theory-Driven Approach. Evaluation Review, 7(3), 283–302

Fitz-Gibbon, C. T., & Morris, L. L. (1975). Theory-based evaluation. Evaluation Comment, 5(1), 1–4. Reprinted in Fitz-Gibbon, C. T., & Morris, L. L. (1996). Theory-based evaluation. Evaluation Practice, 17(2), 177–184.

Halpern, J. Y. (2016). Actual causality. The MIT press.

HM Treasury. (2020). Magenta Book.

Mayne, J. (2019). Revisiting contribution analysis. Canadian Journal of Program Evaluation, 34(2), 171–191.

Rafetseder, E., Cristi‐Vargas, R., & Perner, J. (2010). Counterfactual reasoning: Developing a sense of “nearest possible world”. Child Development, 81(1), 376-389.

Reichardt, C. S. (2022). The Counterfactual Definition of a Program Effect. American Journal of Evaluation43(2), 158–174.

Weiss, C. H. (1997). How can theory-based evaluation make greater headway? Evaluation Review, 21(4), 501–524.

White, H. (2010). A contribution to current debates in impact evaluation. Evaluation, 16(2), 153–164.

Reclaiming the term “theory-based”

Excited to discover a triple of 2024 publications that use the broader conception of theory-based evaluation that includes trials and quasi-experiments, as the term was introduced by Fitz-Gibbon and Morris (1975) and used by, e.g., Chen and Rossi (1980), Coryn et al., (2011), Weiss (1997), Funnell and Rogers, P. J. (2011), Chen (2015), and many others.

I hope the next Magenta Book update has a more nuanced approach that includes RCTs and QEDs under theory-based, alongside, e.g., QCA and uses of Bayes’ rule to reason about qual evidence.

The new:

Bonell, C., Melendez-Torres, G. J., & Warren, E. (2024). Realist trials and systematic reviews: Rigorous, useful evidence to inform health policy. Cambridge University Press.

Matta, C., Lindvall, J., & Ryve, A. (2024). The Mechanistic Rewards of Data and Theory Integration for Theory-Based Evaluation. American Journal of Evaluation, 45(1), 110–132.

Schmidt, R. (2024). A graphical method for causal program attribution in theory-based evaluation. Evaluation, online first.

Key older texts:

Chen, H.-T., & Rossi, P. H. (1980). The Multi-Goal, Theory-Driven Approach to Evaluation: A Model Linking Basic and Applied Social Science. Social Forces, 59, 106–122.

Chen, H.-T., & Rossi, P. H. (1983). Evaluating With Sense: The Theory-Driven Approach. Evaluation Review, 7(3), 283–302.

Chen, H. T. (2015). Practical program evaluation: Theory-driven evaluation and the integrated evaluation perspective (2nd edition). Sage Publications.

Coryn, C. L. S., Noakes, L. A., Westine, C. D., & Schröter, D. C. (2011). A systematic review of theory-driven evaluation practice from 1990 to 2009. American Journal of Evaluation, 32(2), 199–226.

Fitz-Gibbon, C. T., & Morris, L. L. (1975). Theory-based evaluation. Evaluation Comment, 5(1), 1–4. Reprinted in Fitz-Gibbon, C. T., & Morris, L. L. (1996). Theory-based evaluation. Evaluation Practice, 17(2), 177–184.

Funnell, S. C., Rogers, P. J. (2011). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models. Jossey-Bass.

Weiss, C. H. (1997). How can theory-based evaluation make greater headway? Evaluation Review, 21(4), 501–524.

TBE QEDs

‘In TBE [theory-based evaluation] practice […] theory as represented is not specific enough to support causal conclusions in inference […]. For example, in contribution analysis “causal assumptions” refer to a “causal package” consisting of the program intervention and a set of contextual conditions that together may explain an observed change in the outcome […]. In realist evaluation, the causal mechanisms that are triggered by the intervention are specified in “configuration” with their context and the outcome. Often, however, the causal structure of the configuration is not clear […]. Moreover, the main TBE approaches to inference do not have standard practices, conventions, for treating bias in evidence […].

‘TBE practitioners may borrow from other methods to test theoretical assumptions […]. Sometimes TBE employs regression analysis or quasi-experimental propensity score matching in inference (our running example in this article of an actual TBE program evaluation does so).’

Schmidt, R. (2024). A graphical method for causal program attribution in theory-based evaluation. Evaluation, online first.

 

Ignorance of history in evaluation

“Despite occasional statements that program theory is a new approach, its roots go back more than fifty years. […] The history of program theory evaluation is not one of a steady increase in understanding. Instead, many of the key ideas have been well articulated and then ignored or forgotten in descriptions of the approach. It is not unusual to have statements that demonstrate a lack of knowledge of previous empirical and theoretical developments, such as a call for proposals from the Agency for Healthcare Research and Quality (2008) that claimed that “‘theory-based evaluation’ is a relatively new approach” (p. 14).”

Sue C. Funnell and Patricia J. Rogers (2011, pp. 15–16). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models. Jossey-Bass.

What works for whom and in what contexts

“It is sometimes argued that we need rich qualitative data in order to find out not ‘what works’ but for whom and in what contexts. Anyone familiar with the design of experiments will agree. That question is answered by factorial designs with interaction terms… although main effects predominate in most educational datasets. If that last sentence seems like an unfamiliar concept, blame whoever taught you research methods and please make sure that your students are familiar with procedures that will underpin the thousands of controlled trials that education needs if it is to know that rather than simply guessing or asserting why.”

– Carol Taylor Fitz-Gibbon (2002). Knowing Why and Knowing That. Paper presented to the European Evaluation Conference in Seville, Spain. Emphasis and ellipsis in original.

The first definition of something named “theory-based evaluation”

“A theory-based evaluation of a program is one in which the selection of program features to evaluate is determined by an explicit conceptualization of the program in terms of a theory […] which attempts to explain how the program produces the desired effects. The theory might be psychological […] or social psychological […] or philosophical […]. The essential characteristic is that the theory points out a causal relationship between a process A and an outcome B.”

– Carol Taylor Fitz-Gibbon and Lynn Lyons Morris (1975)

References

Fitz-Gibbon, C. T., & Morris, L. L. (1975). Theory-based evaluation. Evaluation Comment, 5(1), 1–4. Reprinted in Fitz-Gibbon, C. T., & Morris, L. L. (1996). Theory-based evaluation. Evaluation Practice, 17(2), 177–184.

Worthen, B. R. (1996). Editor’s Note: The Origins of Theory-Based Evaluation. Evaluation Practice, 17(2), 169–171. This comment traces the path back to Fitz-Gibbon and Morris (1975).

Terminology of programme theory in evaluation

This tickles me (Funnell & Rogers, 2011, pp. 23-24):

Over the years, many different terms have been used to describe the approach to evaluation that is based on a “plausible and sensible model of how the program is supposed to work” (Bickman, 1987b):

      • Chains of reasoning (Torvatn, 1999)
      • Causal chain (Hall and O’Day, 1971)
      • Causal map (Montibeller and Belton, 2006)
      • Impact pathway (Douthwaite et al., 2003)
      • Intervention framework (Ministry of Health, NZ 2002)
      • Intervention logic (Nagarajan and Vanheukelen, 1997)
      • Intervention theory (Argyris, 1970; Fishbein et al., 2001)
      • Logic model (Rogers, 2004)
      • Logical framework (logframe) (Practical Concepts, 1979)
      • Mental model (Senge, 1990)
      • Outcomes hierarchy (Lenne and Cleland, 1987; Funnell, 1990, 1997)
      • Outcomes line
      • Performance framework (Montague, 1998; McDonald and Teather, 1997)
      • Program logic (Lenne and Cleland, 1987; Funnell, 1990, 1997)
      • Program theory (Bickman, 1990)
      • Program theory-driven evaluation science (Donaldson, 2005)
      • Reasoning map
      • Results chain
      • Theory of action (Patton, 1997; Schorr, 1997)
      • Theory of change (Weiss, 1998)
      • Theory-based evaluation (Weiss, 1972; Fitz-Gibbon and Morris, 1975)
      • Theory-driven evaluation (Chen and Rossi, 1983)

References

Funnell, S. C., & Rogers, P. J. (2011). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models. Jossey-Bass.

“Path analysis is conceptually compatible with TBE”

“Analysis of the sequences of data envisioned in TBE [theory-based evaluation] presents many challenges. The basic task is to see how well the evidence matches the theories that were posited. Path analysis is conceptually compatible with TBE and has been used by evaluators (Murray and Smith 1979; Smith 1990), but the recurrent problem is that important variables may be overlooked, the model is incomplete, and hence the results can be misleading. Structural equation modeling through LISREL techniques holds much promise, but it has been used only on a limited scale in evaluation.”

– Weiss, C. H. (1997, p. 512). How can theory-based evaluation make greater headway? Evaluation Review, 21(4), 501–524.

Carol Fitz-Gibbon (1938 – 2017), author of first description of theory-based evaluation, on importance of RCTs

“[…] I produced the first description of theory based evaluation […]. The point of theory based evaluation is to see, firstly, to what extent the theory is being implemented and, secondly, if the predicted outcomes then follow. It is particularly useful as an interim measure of implementation when the outcomes cannot be measured until much later. But most (if not all) theories in social science are only sets of persuasively stated hypotheses that provide a temporary source of guidance. In order to see if the hypotheses can become theories one must measure the extent to which the predicted outcomes are achieved. This requires randomised controlled trials. Even then the important point is to establish the direction and magnitude of the causal relation, not the theory. Many theories can often fit the same data.”

Fitz-Gibbon, C. T. (2002). Researching outcomes of educational interventions. BMJ, 324(7346), 1155.

It’s all theory-based and counterfactual

Two of my favourite articles on evaluation are Cook’s (2000) argument that all impact evaluations, RCTs included, are theory-based and Reichardt’s (2022) argument that there’s always a counterfactual, if not explicitly articulated then not far beneath the surface. I think both arguments are irrefutable, but how we can build on theirs and others’ work to improve evaluation commissioning and delivery seems a formidable challenge given the fiercely defended dichotomies in the field.

If all impact evaluation really is theory-based then it’s clear there’s huge variation in the quality of theories and theorising. If all impact evaluation depends on counterfactuals then there is huge variation in how compelling the evidence is for the counterfactual outcomes, particularly when there is no obvious comparison group.

Clarifying these kinds of distinctions is, I think, important for improving evaluations and the public services and other programmes they evaluate.

References

Cook, T. D. (2000). The false choice between theory-based evaluation and experimentation. In A. Petrosino, P. J. Rogers, T. A. Huebner, & T. A. Hacsi (Eds.), New directions in evaluation: Program Theory in Evaluation: Challenges and Opportunities (pp. 27–34). Jossey-Bass.

Reichardt, C. S. (2022). The Counterfactual Definition of a Program Effect. American Journal of Evaluation, 43(2), 158–174.