Intervening mechanism evaluation

‘The intervening mechanism evaluation approach assesses whether the causal assumptions underlying a program are functioning as stakeholders had projected (Chen, 1990). […] It is not always labeled in the same way by those who apply it. Some evaluators have referred to it as “theory of change evaluation” (Connell, Kubisch, Schorr, & Weiss, 1995) or “theory-based evaluation” (Rogers, Hasci, Petrosino, & Huebner, 2000; Weiss, 1997).’

Chen, H. T. (2015, p. 312). Practical Program Evaluation: Theory-Driven Evaluation and the Integrated Evaluation Perspective. SAGE Publications Ltd.

James Lind (1753), A treatise of the scurvy – key excerpts

Excerpts from the Lind (1753), with help on the ye olde English from others who have quoted him (Hughes, 1975; Bartholomew, 2002; Weber & De Vreese, 2005).

Lind’s study is sometimes presented as an RCT, but it’s not clear how his patients were assigned to groups, just that the cases “were as similar as I could have them” (see discusison in Weber & De Vreese, 2005). Bartholomew (2002) argues that Lind was convinced scurvy was a disease of the digestive system and warns against quoting the positive outcomes for oranges and lemons (and cider) out of the broader context of Lind’s other work.

Here’s what Lind said he did:

“On the 20th May, 1747, I took twelve patients in the scurvy on board the Salisbury at sea. Their cases were as similar as I could have them. They all in general had putrid gums, the spots and lassitude, with weakness of their knees. They lay together in one place, being a proper apartment for the sick in the fore-hold; and had one diet in common to all, viz., water gruel sweetened with sugar in the morning; fresh mutton broth often times for dinner; at other times puddings, boiled biscuit with sugar etc.; and for supper barley, raisins, rice and currants, sago and wine, or the like.”

Groups (n = 2 in each):

  • “ordered each a quart of cyder a day”
  • “twenty five gutts of elixir vitriol three times a day upon an empty stomach, using a gargle strongly acidulated with it for their mouths.”
  • “two spoonfuls of vinegar three times a day upon an empty stomach”
  • “a course of sea water”
  • “two oranges and one lemon given them every day. These they eat with greediness”
  • “The two remaining patients took the bigness of a nutmeg three times a day of an electuray recommended by an hospital surgeon made of garlic, mustard seed, rad. raphan., balsam of Peru and gum myrrh, using for common drink barley water well acidulated with tamarinds, by a decoction of which, with the addition of cremor tartar, they were gently purged three or four times during the course”

Excerpt from the study outcomes:

  • “The consequence was that the most sudden and visible good effects were perceived from the use of the oranges and lemons; one of those who had taken them, being at the end of six days fit for duty”
  • “Next to the oranges, I thought the cyder had the best effects”

References

Bartholomew, M. (2002). James Lind’s Treatise of the Scurvy (1753). Postgraduate Medical Journal, 78, 695–696.

Hughes, R. E. (1975). James Lind and the cure of Scurvy: An experimental approach. Medical History, 19(4), 342–351.

Weber, E., & De Vreese, L. (2005). The causes and cures of scurvy. How modern was James Lind’s methodology? Logic and Logical Philosophy, 14(1), 55–67.

The Bletchley Declaration

Worth a read. One key para:

“We affirm that, whilst safety must be considered across the AI lifecycle, actors developing frontier AI capabilities, in particular those AI systems which are unusually powerful and potentially harmful, have a particularly strong responsibility for ensuring the safety of these AI systems, including through systems for safety testing, through evaluations, and by other appropriate measures.”

Terminology of programme theory in evaluation

This tickles me (Funnell & Rogers, 2011, pp. 23-24):

Over the years, many different terms have been used to describe the approach to evaluation that is based on a “plausible and sensible model of how the program is supposed to work” (Bickman, 1987b):

      • Chains of reasoning (Torvatn, 1999)
      • Causal chain (Hall and O’Day, 1971)
      • Causal map (Montibeller and Belton, 2006)
      • Impact pathway (Douthwaite et al., 2003)
      • Intervention framework (Ministry of Health, NZ 2002)
      • Intervention logic (Nagarajan and Vanheukelen, 1997)
      • Intervention theory (Argyris, 1970; Fishbein et al., 2001)
      • Logic model (Rogers, 2004)
      • Logical framework (logframe) (Practical Concepts, 1979)
      • Mental model (Senge, 1990)
      • Outcomes hierarchy (Lenne and Cleland, 1987; Funnell, 1990, 1997)
      • Outcomes line
      • Performance framework (Montague, 1998; McDonald and Teather, 1997)
      • Program logic (Lenne and Cleland, 1987; Funnell, 1990, 1997)
      • Program theory (Bickman, 1990)
      • Program theory-driven evaluation science (Donaldson, 2005)
      • Reasoning map
      • Results chain
      • Theory of action (Patton, 1997; Schorr, 1997)
      • Theory of change (Weiss, 1998)
      • Theory-based evaluation (Weiss, 1972; Fitz-Gibbon and Morris, 1975)
      • Theory-driven evaluation (Chen and Rossi, 1983)

References

Funnell, S. C., & Rogers, P. J. (2011). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models. Jossey-Bass.

From metaphysics to goals in social research and evaluation

Some of the social research and evaluation papers I encounter include declarations of the authors’ metaphysical stance: social constructionist, realist (critical or otherwise), phenomenologist – and sometimes a dig at positivism. This is one way research and researchers are classified. Clearly there are different kinds of research; however, might it be easiest to see the differences in terms of research goals rather than jargon-heavy isms? Here are three examples of goals, to try to explore what I mean.

Evoke empathy. If you can’t have a chat with someone then the next best way to empathise with them is via a rich description by or about them. There is a bucket-load of pretentiousness in the literature (search for “thick description” to find some). But skip over this and there are wonderful works that are simply stories. Biographies you read which make you long to meet the subject. Film documentaries, though not fitting easily into traditional research output, are another. Anthologies gathering expressions of people’s lived experience without a researcher filter. “Interpretative Phenomenological Analyses” manage to include stories too, though with more metaphysics.

Classify. This may be the classification of perspectives, attitudes, experiences, processes, organisations, or other stuff-that-happens in society. For example: social class, personality, experiences people have in psychological therapy, political orientation, emotional experiences. The goal here is to develop patterns, whether from thematic analysis of interview responses, latent class analysis of answers on Likert scales, or some other kind of data and analysis. There’s no escaping theory, articulated and debated or unarticulated and unchallenged, when doing this.

Predict. Do people occupying a particular social class location tend to experience some mental health difficulties more often than others? Does your personality predict the kinds of books you like to read. Do particular events predict an emotion you will feel? Other predictions concern the impact of interventions of various kinds (broadly construed). What would happen if you funded national access to cognitive behavioural therapy or universal basic income? Theory matters here too, usually involving a story or model of why variables relate to each other. Prediction can be statistical or may involve gathering views on expert opinion (expert by lived experience or profession).

These goals cannot be straightforwardly mapped onto quantitative and qualitative data and analysis. As a colleague and I wrote (Fugard & Potts, 2016):

“Some qualitative research develops what looks like a taxonomy of experiences or phenomena. Much of this isn’t even framed as qualitative. Take for example Gray’s highly-cited work classifying type 1 and type 2 synapses. His labelled photos of cortex slices illustrate beautifully the role of subjectivity in qualitative analysis and there are clear questions about generalisability. Some qualitative analyses use statistical models of quantitative data, for example latent class analyses showing the different patterns of change in psychological therapies.”

What I personally want to see, as an avid reader of research, is a summary of the theory – topic-specific, substantive theory rather than metaphysical – that researchers had before launching into gathering data; how they plan to analyse the data; and what they think about the theory when they finished. Ideally I also want to know something about the politics driving the research, whether expressed in terms of conflicts of interest or the authors’ position on inequity or oppression investigated in a study. Reflections on ontological realism and epistemic relativity – less so.

Core elements in theory-driven evaluation

Huey Chen (1990) solved many issues that are still endlessly discussed in evaluation, e.g., the role of stakeholder theories versus social science theories and the different ways theories can be tested. Here’s a useful summary of core elements of a theory-driven approach (Coryn et al., 2011, Table 1, p. 205):

1. Theory-driven evaluations/evaluators should formulate a plausible program theory

a. Formulate program theory from existing theory and research (e.g., social science theory)

b. Formulate program theory from implicit theory (e.g., stakeholder theory)

c. Formulate program theory from observation of the program in operation/exploratory research (e.g., emergent theory)

d. Formulate program theory from a combination of any of the above (i.e., mixed/integrated theory)

2. Theory-driven evaluations/evaluators should formulate and prioritize evaluation questions around a program theory

a. Formulate evaluation questions around program theory

b. Prioritize evaluation questions

3. Program theory should be used to guide planning, design, and execution of the evaluation under consideration of relevant contingencies

a. Design, plan, and conduct evaluation around a plausible program theory

b. Design, plan, and conduct evaluation considering relevant contingencies (e.g., time, budget, and use)

c. Determine whether evaluation is to be tailored (i.e., only part of the program theory) or comprehensive

4. Theory-driven evaluations/evaluators should measure constructs postulated in program theory

a. Measure process constructs postulated in program theory

b. Measure outcome constructs postulated in program theory

c. Measure contextual constructs postulated in program theory

5. Theory-driven evaluations/evaluators should identify breakdowns, side effects, determine program effectiveness (or efficacy), and explain cause-and-effect associations between theoretical constructs

a. Identify breakdowns, if they exist (e.g., poor implementation, unsuitable context, and theory failure)

b. Identify anticipated (and unanticipated), unintended outcomes (both positive and negative) not postulated by program theory

c. Describe cause-and-effect associations between theoretical constructs (i.e., causal description)

d. Explain cause-and-effect associations between theoretical constructs (i.e., causal explanation)

i. Explain differences in direction and/or strength of relationship between program and outcomes attributable to moderating factors/variables

ii. Explain the extent to which one construct (e.g., intermediate outcome) accounts for/mediates the relationship between other constructs

References

Chen, H. T. (1990). Theory-driven evaluations. Thousand Oaks, CA: Sage.

Coryn, C. L. S., Noakes, L. A., Westine, C. D., & Schröter, D. C. (2011). A systematic review of theory-driven evaluation practice from 1990 to 2009. American Journal of Evaluation, 32(2), 199–226. https://doi.org/10.1177/1098214010389321

“Path analysis is conceptually compatible with TBE”

“Analysis of the sequences of data envisioned in TBE [theory-based evaluation] presents many challenges. The basic task is to see how well the evidence matches the theories that were posited. Path analysis is conceptually compatible with TBE and has been used by evaluators (Murray and Smith 1979; Smith 1990), but the recurrent problem is that important variables may be overlooked, the model is incomplete, and hence the results can be misleading. Structural equation modeling through LISREL techniques holds much promise, but it has been used only on a limited scale in evaluation.”

– Weiss, C. H. (1997, p. 512). How can theory-based evaluation make greater headway? Evaluation Review, 21(4), 501–524.

Counterfactual analysis as fatalism

‘Many counterfactual analyses are based, explicitly or implicitly, on an attitude that I term fatalism. This considers the various potential responses \(Y_{i}(u)\), when treatment \(i\) is applied to unit \(u\), as predetermined attributes of unit \(u\), waiting only to be uncovered by suitable experimentation. (It is implicit that the unit \(u\) and its properties and propensities exist independently of, and are unaffected by, any treatment that may be applied.) Note that because each unit label \(u\) is regarded as individual and unrepeatable, there is never any possibility of empirically testing this assumption of fatalism, which thus can be categorized as metaphysical.’

– Dawid, A. P. (2000, pp. 412-413) [Causal inference without counterfactuals. Journal of the American Statistical Association, 95, 407–424].

Counterfactual talk as nonsense

‘We all indulge, in anger and regret, in counterfactual talk: “If they had not operated, John would be alive today”; “If I had not said that, she would not have left me”; “If I had chosen a different publisher, my book on causality without counterfactuals would have sold 10,000 copies.” The more fortunate among us have someone to remind us that we are talking nonsense.’

– Shafer, G. (2000, p. 442) [Causal inference without counterfactuals: Comment. Journal of the American Statistical Association, 95, 438–442].

 

Time for counterfactuals

I have just discovered Scriven’s stimulating (if grim) challenge to a counterfactual understanding of causation (see the debate recorded in Cook et al., 2010, p. 108):

“The classic example of this is the guy who has jumped off the top of a skyscraper and as he passes the 44th floor somebody shoots him through the head with a .357 magnum. Well, it’s clear enough that the shooter killed him but it’s clearly not true that he would not have died if the shooter hadn’t shot him; so the counterfactual condition does not apply, so it can’t be an essential part of the meaning of cause.”

I love this example because it illustrates a common form of programme effect and summarises the human condition – all in a couple of sentences! Let’s reshape it into an analogous example that extends the timeline by a couple of decades:

“A 60 year old guy chooses not to get a Covid vaccine. A few months later, he gets Covid and dies. Average male life expectancy is about 80 years.”

(I guess jumping is analogous to being born!)

By the end of the first sentence, I reason that if he had got the vaccine, he probably wouldn’t have died. By the end of the second sentence, I am reminded of the finiteness of life. So, the vaccine didn’t prevent death – similarly to an absence of a gunshot in the skyscraper example. How can we think about this using counterfactuals?

In a programme evaluation, it is common to gather data at a series of fixed time points, for instance a few weeks, months, and, if you are lucky, years after baseline. We are often happy to see improvement even if it doesn’t endure. For instance, if I take a painkiller, I don’t expect its effects to persist forevermore. If a vaccine extends life by two decades, that’s rather helpful. Programme effects are defined at each time point.

To make sense of the original example, we need to add in time. There are three key timepoints:

  1. Jumping (T0).
  2. Mid-flight after the gunshot (T1).
  3. Hitting the ground (T2).

When considering counterfactuals, the world may be different at each of these times, e.g., at T0 the main character might have decided to take the lift.

Here are counterfactuals that make time explicit:

  • If the guy hadn’t jumped at T0, then he wouldn’t have hit the ground at T2.
  • If the guy hadn’t jumped at T0, then he wouldn’t have been shot with the magnum and killed at T1.
  • If the guy had jumped, but hadn’t been shot by the magnum, he would still have been alive at T1 but not at T2.

To assign truth values or probabilities to each of these requires a model of some description, e.g., a causal Bayesian network, which formalises your understanding of the intentions and actions of the characters in the text – something like the DAG below, with conditional probabilities filled in appropriately.

So for instance, the probability of being dead at T2 given jumping at T0 is high – if you haven’t added variables about parachutes. What happens mid-flight governs T1 outcomes. Alternatively, you could just use informal intutition. Exercise to the reader: give it a go.

Using the Halpern-Pearl definitions of causality on this model (Halpern, 2016), jumping caused death at both T1 and T2. The shooting caused death at T1 but not T2. (R code here – proper explanation to be completed, but you could try this companion blog post and citation therein.)

Back then to the vaccine example, the counterfactuals rewrite to something like:

  • If the guy hadn’t been born at T0, then he wouldn’t have died at T2.
  • If the guy hadn’t been born at T0, then he couldn’t have chosen not to get a vaccine and died at T1.
  • If the guy had been born, but had decided to get the vaccine, he would still have been alive at T1 aged 60, but possibly not at T2 aged 80.

References

Cook, T. D., Scriven, M., Coryn, C. L. S., & Evergreen, S. D. H. (2010). Contemporary Thinking About Causation in Evaluation: A Dialogue With Tom Cook and Michael Scriven. American Journal of Evaluation, 31(1), 105–117.

Halpern, J. Y. (2016). Actual causality. The MIT press.