Hedges’ g for multilevel models in R {lmeInfo}

This package looks useful (for {nlme} not {lme4}).

“Provides analytic derivatives and information matrices for fitted linear mixed effects (lme) models and generalized least squares (gls) models estimated using lme() (from package ‘nlme’) and gls() (from package ‘nlme’), respectively. The package includes functions for estimating the sampling variance-covariance of variance component parameters using the inverse Fisher information. The variance components include the parameters of the random effects structure (for lme models), the variance structure, and the correlation structure. The expected and average forms of the Fisher information matrix are used in the calculations, and models estimated by full maximum likelihood or restricted maximum likelihood are supported. The package also includes a function for estimating standardized mean difference effect sizes (Pustejovsky, Hedges, and Shadish (2014) <doi:10.3102/1076998614547577>) based on fitted lme or gls models.”

How Many Imputations Do You Need? {howManyImputations}

“When performing multiple imputations, while 5-10 imputations are sufficient for obtaining point estimates, a larger number of imputations are needed for proper standard error estimates. This package allows you to calculate how many imputations are needed, following the work of von Hippel (2020).”

Useful example here.


Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.


S-values are a neat idea for helping to think about – maybe even feel – the meaning of p-values. They are described by Rafi and Greenland (2020). This post explains the basic idea, beginning with flips of a coin.

  • Suppose you flip an apparently fair coin and the outcome is heads. How surprised would you feel?
  • Now flip it twice: both outcomes are heads. How surprised would you feel now?
  • Flip the coin 40 times. All of the outcomes are heads. How surprised are you now?

I suspect your level of surprise has gone from something like “meh” through to “surely this isn’t a fair coin?!”

S-values (the S is for surprisal) provide a way to think about p-values in terms of how likely it would be to get a sequence of all heads from a number of flips of a fair coin. That number of flips is the s-value and can be calculated from the p-value.

Here is an example of a coin flipped three times. There are \(2^3 = 8\) possible outcomes, listed in the table below:

First coin flip Second coin flip Third coin flip

If the coin is fair, then the probability of each of these outcomes is \(\frac{1}{8}\). In particular, the probability of all heads is also \(\frac{1}{8}\), or 0.125.

More generally, the probability of getting all heads from \(n\) fair coin flips is \(\frac{1}{2^n}\). Here is a table showing some examples:

Flips Probability all heads
1 0.5
2 0.25
3 0.125
4 0.0625
5 0.03125
6 0.01562
7 0.00781
8 0.00391
9 0.00195
10 0.00098

Now here is the connection with p-values. Suppose you run a statistical test and get \(p = 0.03125\); that’s the same probability as that of obtaining five heads in a row from five coin tosses. The s-value is 5.

Or suppose you merely got \(p = 0.5\). That’s the probability of obtaining heads after one flip of the coin. The s-value is 1.

The larger the s-value, the more surprised you would be if the coin were fair.

To convert p-values to s-values, we want to find an \(s\), such that

\(\displaystyle \frac{1}{2^s} = p\).

The log function (base 2) does this for us:

\(\displaystyle s = -\log_2(p)\).

What about the traditional (or notorious?) 0.05 level?

\(-\log_2(0.05) = 4.32\),

to two decimal places. So, that’s the same as getting all heads when you flip a coin 4.32 times – which isn’t entirely intuitive when expressed as coin flips. But you could think of it being a little more surprising than getting four heads in a row if you flipped a fair coin four times.


Rafi, Z., Greenland, S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol 20, 244 (2020).

The interocular traumatic test

“The preceding paragraph illustrates a procedure that statisticians of all schools find important but elusive. It has been called the interocular traumatic test; you know what the data mean when the conclusion hits you between the eyes. The interocular traumatic test is simple, commands general agreement, and is often applicable; well-conducted experiments often come out that way. But the enthusiast’s interocular trauma may be the skeptic’s random error. A little arithmetic to verify the extent of the trauma can yield great peace of mind for little cost.” (Edwards, Lindman, & Savage, 1963, p.217, who attribute the “test” to J. Berkson, personal comm., 14 July 1958)

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3), 193.

“Counterfactual” is not a synonym for “control group”

“Counterfactual” is not a synonym for “control group”. In fact, the treatment group’s actual outcomes are used when estimating the control group’s counterfactual outcomes, which is necessary to estimate the average treatment effect on control (ATC) or average treatment effect (ATE) estimands.

An individual’s treatment effect is defined as a within-person difference between the potential outcome following treatment and the potential outcome following control. This individual treatment effect is impossible to measure, since only one potential outcome is realised depending on which group the individual was in. However, various averages of the treatment effects can be estimated.

For ATC, we are interested in estimating averages of these treatment effects for control group participants. We know the control group’s actual outcomes. We also need to answer the counterfactual query:

If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATC using matching, we need to find a treatment group match for each individual in the control group. Those treatment group matches are used to estimate the control group’s counterfactual outcomes.

For ATE, we are interested in estimating averages of these treatment effects for all participants. This means we need a combination of answers to the following counterfactual queries:

(a) If individuals in the treatment group had been assigned control, what would their average outcome have been?

(b) If individuals in the control group had been assigned treatment, what would their average outcome have been?

To estimate ATE using matching, each treatment individual needs a control group match and each control group individual needs a treatment group match. So, for ATE, both treatment and control groups could be considered counterfactuals, in the sense that they are both used to estimate the other group’s counterfactual outcomes. However, I think it is clearer if we draw a distinction between group (treatment or control) and what we are trying to estimate using data from a group (actual or counterfactual outcomes).

(If you found this post interesting, please do say hello and let me know!)

Applying process tracing to RCTs

Process tracing is an application of Bayes’ theorem to test hypotheses using qualitative evidence.¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. This  means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.²

Next, let’s run the mediation analysis. It is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been statistically significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Is this process tracing or simply putting Bayes’ rule to work as usual? Does this example show that RCTs can be theory-based evaluations, since process tracing is a theory-based method, or does the inclusion of a control group rule out that possibility, as Figure 3.1 of the Magenta Book would suggest? I will leave the reader to assign probabilities to each possible conclusion. Let me know what you think.

¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully all is clear given the prose.

For example, P(Hyp = Med | ATE = Alternative) =  47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Variance estimation when matching with replacement

“Matching with replacement induces two types of correlations that must be accounted for when estimating the variance of estimated treatment effects. The first is a within-matched set correlation in outcomes. Matched subjects within the same matched set have similar values of the propensity score. Subjects who have the same value of the propensity score have measured baseline covariates that come from the same multivariate distribution. In the presence of confounding, baseline covariates are related to the outcome. Thus, matched subjects are more likely to have similar outcomes compared to two randomly selected subjects. The second source of correlation is induced by repeated use of control subjects. Failure to account for this correlation and acting as though the matched control subjects were independent observations will likely result in estimated standard errors that are artificially small and estimated confidence intervals that are artificially narrow. Added complexity is introduced by having subjects cross-classified with matched sets such that the same control subject can belong to more than one matched set.”

Austin, P. C., & Cafri, G. (2020, p. 1625). [Variance estimation when using propensity‐score matching with replacement with survival or time‐to‐event outcomes. Statistics in Medicine, 39(11), 1623–1640.]