Hedges’ g for multilevel models in R {lmeInfo}

This package looks useful (for {nlme} not {lme4}).

“Provides analytic derivatives and information matrices for fitted linear mixed effects (lme) models and generalized least squares (gls) models estimated using lme() (from package ‘nlme’) and gls() (from package ‘nlme’), respectively. The package includes functions for estimating the sampling variance-covariance of variance component parameters using the inverse Fisher information. The variance components include the parameters of the random effects structure (for lme models), the variance structure, and the correlation structure. The expected and average forms of the Fisher information matrix are used in the calculations, and models estimated by full maximum likelihood or restricted maximum likelihood are supported. The package also includes a function for estimating standardized mean difference effect sizes (Pustejovsky, Hedges, and Shadish (2014) <doi:10.3102/1076998614547577>) based on fitted lme or gls models.”

Why does everyone love a good RCT?

The individual treatment effect is defined as an individual’s potential outcome under treatment minus their potential outcome under control. This within-participant difference cannot be directly measured since only one of the two potential outcomes is realised depending on whether the participant was exposed to treatment or control.

Everyone loves a good randomised controlled trial because the mean outcome of people who were exposed to treatment minus the mean outcome of people who were exposed to control – a between-participant difference – is an unbiased estimator of the mean of within-participant individual treatment effects.

I’ve coded up a simulation in R over here to illustrate how they work. Note in particular the importance of confidence intervals!

On the term “randomista”

Sophie Webber and Carolyn Prouse (2018, p. 169, footnote 1) write:

Randomistas is a slang term used by critics to describe proponents of the RCT methodology. It is almost certainly a gendered, derogatory term intended to flippantly dismiss experimental economists and their success, particularly Esther Duflo, one of the most successful experts on randomization.” [Emphasis original.]


Sophie Webber and Carolyn Prouse (2018). The New Gold Standard: The Rise of Randomized Control Trials and Experimental Development. Economic Geography, 94(2), 166–187.

Degtiar & Rose (2023) – A Review of Generalizability and Transportability

“This article presents a framework for addressing external validity bias, including a synthesis of approaches for generalizability and transportability, and the assumptions they require, as well as tests for the heterogeneity of treatment effects and differences between study and target populations.”


Degtiar, I., & Rose, S. (2023). A Review of Generalizability and Transportability. Annual Review of Statistics and Its Application, 10(1), 501–524.

What works for whom and in what contexts

“It is sometimes argued that we need rich qualitative data in order to find out not ‘what works’ but for whom and in what contexts. Anyone familiar with the design of experiments will agree. That question is answered by factorial designs with interaction terms… although main effects predominate in most educational datasets. If that last sentence seems like an unfamiliar concept, blame whoever taught you research methods and please make sure that your students are familiar with procedures that will underpin the thousands of controlled trials that education needs if it is to know that rather than simply guessing or asserting why.”

– Carol Taylor Fitz-Gibbon (2002). Knowing Why and Knowing That. Paper presented to the European Evaluation Conference in Seville, Spain. Emphasis and ellipsis in original.

Inside every matching study

A potentially useful one-sentence(!) intervention for making a case to run a statistical matching evaluation rather than a randomised controlled trial:

“Matching can be thought of as a technique for finding approximately ideal experimental data hidden within an observational data set.”

– King, G., & Nielsen, R. (2019, p. 442) [Why Propensity Scores Should Not Be Used for Matching. Political Analysis, 27(4), 435–454]


Carol Fitz-Gibbon (1938 – 2017), author of first description of theory-based evaluation, on importance of RCTs

“[…] I produced the first description of theory based evaluation […]. The point of theory based evaluation is to see, firstly, to what extent the theory is being implemented and, secondly, if the predicted outcomes then follow. It is particularly useful as an interim measure of implementation when the outcomes cannot be measured until much later. But most (if not all) theories in social science are only sets of persuasively stated hypotheses that provide a temporary source of guidance. In order to see if the hypotheses can become theories one must measure the extent to which the predicted outcomes are achieved. This requires randomised controlled trials. Even then the important point is to establish the direction and magnitude of the causal relation, not the theory. Many theories can often fit the same data.”

Fitz-Gibbon, C. T. (2002). Researching outcomes of educational interventions. BMJ, 324(7346), 1155.

Beautiful friendships have been jeopardised

This is an amusing opening to a paper on face validity, by Mosier (1947):

“Face validity is a term that is bandied about in the field of test construction until it seems about to become a part of accepted terminology. The frequency of its use and the emotional reaction which it arouses-ranging almost from contempt to highest approbation-make it desirable to examine its meaning more closely. When a single term variously conveys high praise or strong condemnation, one suspects either ambiguity of meaning or contradictory postulates among those using the term. The tendency has been, I believe, to assume unaccepted premises rather than ambiguity, and beautiful friendships have been jeopardized when a chance remark about face validity has classed the speaker among the infidels.”

I think dozens of beautiful friendships have been jeopardized by loose talk about randomised controlled trials, theory-based evaluation, realism, and positivism, among many others. I’ve just seen yet another piece arguing that you wouldn’t evaluate a parachute with an RCT and I can’t even.


Mosier, C. I. (1947). A Critical Examination of the Concepts of Face Validity. Educational and Psychological Measurement, 7(2), 191–205.

Applying process tracing to RCTs

Process tracing is an application of Bayes’ theorem to test hypotheses using qualitative evidence.¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. This  means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.²

Next, let’s run the mediation analysis. It is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been statistically significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Is this process tracing or simply putting Bayes’ rule to work as usual? Does this example show that RCTs can be theory-based evaluations, since process tracing is a theory-based method, or does the inclusion of a control group rule out that possibility, as Figure 3.1 of the Magenta Book would suggest? I will leave the reader to assign probabilities to each possible conclusion. Let me know what you think.

¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully all is clear given the prose.

For example, P(Hyp = Med | ATE = Alternative) =  47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Baseline balance in experiments and quasi-experiments

Baseline balance is important for both experiments and quasi-experiments, just not in the way researchers sometimes believe. Here are excerpts from three of my favourite discussions of the topic.

Don’t test for baseline imbalance in RCTs. Senn (1994,  p. 1716):

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘[statistically] significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

Do examine baseline imbalance in quasi-experiments; however, not by using statistical tests. Sample descriptives, such as a difference in means, suffice. Imai et al. (2008, p. 497):

“… from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant…”

Using p-values from t-tests and similar can lead to erroneous decisions of balance. As you prune a dataset to improve balance, power to detect effects decreases. Imai et al. (2008, p. 497 again):

“Since the values of […] hypothesis tests are affected by factors other than balance, they cannot even be counted on to be monotone functions of balance. The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics…”

If your matching has led to baseline balance, then you’re good, even if the matching model is misspecified. (Though not if you’re missing key covariates, of course.) Rosenbaum (2023, p. 29):

“So far as matching and stratification are concerned, the propensity score and other methods are a means to an end, not an end in themselves. If matching for a misspecified and misestimated propensity score balances x, then that is fine. If by bad luck, the true propensity score failed to balance x, then the match is inadequate and should be improved.”


Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481–502.

Rosenbaum, P. R. (2023). Propensity score. In J. R. Zubizarreta, E. A. Stuart, D. S. Small, & P. R. Rosenbaum, Handbook of Matching and Weighting Adjustments for Causal Inference (pp. 21–38). Chapman and Hall/CRC.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine13, 1715–1726.