A cynical view of SEMs

It is all too common for a box and arrow diagram to be cobbled together in an afternoon and christened a “theory of change”. One formalised version of such a diagram is a structural equation model (SEM), the arrows of which are annotated with coefficients estimated using data. Here is John Fox (2002) on SEM and informal boxology:

“A cynical view of SEMs is that their popularity in the social sciences reflects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis.”

References

Fox, J. (2002). Structural Equation Models: Appendix to An R and S-PLUS Companion to Applied Regression. Last corrected 2006.

Do “Growth Mindset” interventions improve students’ academic attainment?

“We conducted a systematic review and multiple meta-analyses of the growth mindset intervention literature. Our goal was to answer two questions: (a) Do growth mindset interventions generally improve students’ academic achievement? and (b) Are growth mindset intervention effects due to instilling growth mindsets in students or are apparent effects due to shortcomings in study designs, analyses, and reporting? To answer these questions, we systematically reviewed the literature and conducted multiple meta-analyses imposing varying degrees of quality control. Our results indicated that apparent effects of growth mindset interventions are possibly due to inadequate study designs, reporting flaws, and bias. In particular, the systematic review yielded several concerning patterns of threats to internal validity.”

Here’s a pic:

Privacy implications of hashing data, by John Cook

Cryptographic hash functions are sometimes used to create pseudo IDs from identifiable info like NHS numbers. An advantage of this approach is that two sites that have no way to communicate with each other will generate the same pseudo ID, allowing data to be linked. A disadvantage is that, although in general hashes are impossible to inverse, a rainbow table lookup attack can be used to inverse the hash when the space of inputs is relatively small, as is the case for ID numbers. John Cook explores.

Baseline balance in experiments and quasi-experiments

Baseline balance is important for both experiments and quasi-experiments, just not in the way researchers sometimes believe. Here are excerpts from three of my favourite discussions of the topic.

Don’t test for baseline imbalance in RCTs. Senn (1994,  p. 1716):

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘[statistically] significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

Do examine baseline imbalance in quasi-experiments; however, not by using statistical tests. Sample descriptives, such as a difference in means, suffice. Imai et al. (2008, p. 497):

“… from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant…”

Using p-values from t-tests and similar can lead to erroneous decisions of balance. As you prune a dataset to improve balance, power to detect effects decreases. Imai et al. (2008, p. 497 again):

“Since the values of […] hypothesis tests are affected by factors other than balance, they cannot even be counted on to be monotone functions of balance. The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics…”

If your matching has led to baseline balance, then you’re good, even if the matching model is misspecified. (Though not if you’re missing key covariates, of course.) Rosenbaum (2023, p. 29):

“So far as matching and stratification are concerned, the propensity score and other methods are a means to an end, not an end in themselves. If matching for a misspecified and misestimated propensity score balances x, then that is fine. If by bad luck, the true propensity score failed to balance x, then the match is inadequate and should be improved.”

References

Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481–502.

Rosenbaum, P. R. (2023). Propensity score. In J. R. Zubizarreta, E. A. Stuart, D. S. Small, & P. R. Rosenbaum, Handbook of Matching and Weighting Adjustments for Causal Inference (pp. 21–38). Chapman and Hall/CRC.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine13, 1715–1726.

Communication is probably more than 7% verbal

“Have you ever heard the adage that communication is only 7 percent verbal and 93 percent non-verbal, i.e. body language and vocal variety? You probably have, and if you have any sense at all, you have ignored it.” Philip Yaffe wades into this one. The first two pages provide a concise summary of the 1967 studies that produced this 7% figure:

Subjects were asked to listen to a recording of a woman’s voice saying the word “maybe” three different ways to convey liking, neutrality, and disliking. They were also shown photos of the woman’s face conveying the same three emotions. They were then asked to guess the emotions heard in the recorded voice, seen in the photos, and both together. The result? The subjects correctly identified the emotions 50 percent more often from the photos than from the voice.

In the second study, subjects were asked to listen to nine recorded words, three meant to convey liking (honey, dear, thanks), three to convey neutrality (maybe, really, oh), and three to convey disliking (don’t, brute, terrible). Each word was pronounced three different ways. When asked to guess the emotions being conveyed, it turned out that the subjects were more influenced by the tone of voice than by the words themselves.

The original studies behind the figure look interesting for what they actually tried to do rather than the bullshit claims that resulted.

Originals

Mehrabian, A., & Wiener, M. (1967). Decoding of inconsistent communications. Journal of Personality and Social Psychology, 6(1), 109–114.

Mehrabian, A., & Ferris, S. R. (1967). Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology, 31(3), 248–252.

Can you bullshit a bullshitter?

You can bullshit a bullshitter, except if they also have high cognitive ability, according to Littrell et al. (2021).

Littrell, S., Risko, E. F., & Fugelsang, J. A. (2021). ‘You can’t bullshit a bullshitter’ (or can you?): Bullshitting frequency predicts receptivity to various types of misleading information. British Journal of Social Psychology, 60(4), 1484–1505.

History repeating in psychedelics research

Interesting paper by Michiel van Elk and Eiko Fried (2023) on flawed evaluations of psychedelics to treat mental health conditions and how to do better. Neat 1966 quotation at the end:

“Given the current state of research, strong caution is warranted regarding the hype around psychedelics as treatments: there is not enough robust evidence to draw any firm conclusions about the safety and efficacy of psychedelic therapy. […] For psychedelic research in particular, we are not the first to raise concerns and can only echo the warning expressed more than half a century ago [Masters & Houston, 1966]:

“‘To be hopeful and optimistic about psychedelic drugs and their potential is one thing; to be messianic is another. Both the present and the future of psychedelic research already have been grievously injured by a messianism that is as unwarranted as it has proved undesirable.'”

References

Van Elk, M., & Fried, E. I. (2023). History repeating: Guidelines to address common problems in psychedelic science. Therapeutic Advances in Psychopharmacology, 13.

Five questions to ask of social research

  1. Why should I care about this sample? Is the sample itself of interest, whether 1 person (e.g., a biography-like case study) or 1,000?
  2. If generalisation to a broader population is intended or implied, how is the case made that the findings do generalise?
  3. To what extent do findings depend on participants being able to explain why they acted the way they did? People sometimes tell more than they can know.
  4. If the researchers state that X caused or contributed to Y, what evidence is provided that if X hadn’t been the case, then Y would have been different?
  5. What political agendas do the researchers and their institutions have, e.g., as influenced by who funds them?

Dealing with confounding in observational studies

Excellent review of simulation-based evaluations of quasi-experimental methods, by Varga et al. (2022). Also lovely annexes summarising the methods’ assumptions.

Methods for measured confounding the authors cover (Varga et al., 2022, Table A1):

Method Description of the method
PS matching (N = 47) Treated and untreated individuals are matched based on their propensity score-similarity. After creating comparable groups of treated and untreated individuals the effect of the treatment can be estimated.
IPTW (N = 30) With the help of re-weighting by the inverse probability of receiving the treatment, a synthetic sample is created which is representative of the population and in which treatment assignment is independent of the observed baseline covariates. Over-represented groups are downweighted and underrepresented groups are upweighted.
Overlap weights (N = 4) Overlap weights were developed to overcome the limitations of truncation and trimming for IPTW, when some individual PSs approach 0 or 1.
Matching weights (N = 2) Matching weights is an analogue weighting method for IPTW, when some individual PSs approach 0 or 1.
Covariate adjustment using PS (N = 13) The estimated PS is included as covariate in a regression model of the treatment.
PS stratification (N = 26) First the subjects are grouped into strata based upon their PS. Then, the treatment effect is estimated within each PS stratum, and the ATE is computed as a weighted mean of the stratum specific estimates.
GAM (N = 1) GAMs provide an alternative for traditional PS estimation by replacing the linear component of a logistic regression with a flexible additive function.
GBM (N = 3) GBM trees provide an alternative for traditional PS estimation by estimating the function of covariates in a more flexible manner than logistic regression by averaging the PSs of small regression trees.
Genetic matching (N = 7) This matching method algorithmically optimizes covariate balance and avoids the process of iteratively modifying the PS model.
Covariate-balancing PS (N = 5) Models treatment assignment while optimizing the covariate balance. The method exploits the dual characteristics of the PS as a covariate balancing score and the conditional probability of treatment assignment.
DR estimation (N = 13) Combines outcome regression with with a model for the treatment (eg, weighting by the PS) such that the effect estimator is robust to misspecification of one (but not both) of these models.
AIPTW (N = 8) This estimator achieves the doubly-robust property by combining outcome regression with weighting by the PS.
Stratified DR estimator (N = 1) Hybrid DR method of outcome regression with PS weighting and stratification.
TMLE (N = 2) Semi-parametric double-robust method that allows for flexible estimation using (nonparametric) machine-learning methods.
Collaborative TMLE (N = 1) Data-adaptive estimation method for TMLE.
One step joint Bayesian PS (N = 3) Jointly estimates quantities in the PS and outcome stages.
Two-step Bayesian approach (N = 2) A two-step modeling method is using the Bayesian PS model in the first step, followed by a Bayesian outcome model in the second step.
Bayesian model averaging (N = 1) Fully Bayesian model averaging approach.
An’s intermediate approach (N = 2) Not fully Bayesian insofar as the outcome equation in An’s approach is frequentist.
G-computation (N = 4) The method interprets counterfactual outcomes as missing data and uses a prediction model to obtain potential outcomes under different treatment scenarios. The entire set of predicted outcomes is then regressed on the treatment to obtain the coefficient of the effect estimate.
Prognostic scores (N = 7) Prognostic scores are considered to be the prognostic analog of the PS methods. the prognostic score includes covariates based on their predictive power of the response, the PS includes covariates that predict treatment assignment.

Methods for unmeasured confounding (Varga et al., 2022, Table A2):

Method Description of the method
IV approach (N = 17) Post-randomization can be achieved using a sufficiently strong instrument. IV is correlated with the treatment and only affects the outcome through the treatment.
2SLS (N = 11) Linear estimator of the IV method. Uses linear probability for binary outcome and linear regression for continuous outcome.
2SPS (N = 5) Non-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of 2SPS procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression: the predicted value of treatment replaces the observed treatment for 2SPS.
2SRI (N = 8) Semi-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of the 2SRI procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression.
IV based on generalized structural mean model (GSMM) (N = 1) Semi-parametric models that use instrumental variables to identify causal parameters. IV approach
Instrumental PS (Matching enhanced IV) (N = 2) Reduces the dimensionality of the measured confounders, but it also deals with unmeasured confounders by the use of an IV.
DiD (N = 7) DiD method uses the assumption that without the treatment the average outcomes for the treated and control groups would have followed parallel trends over time. The design measures the effect of a treatment as the relative change in the outcomes between individuals in the treatment and control groups over time.
Matching combined with DiD (N = 6) Alternative approach to DiD. (2) Uses matching to balance the treatment and control groups according to pre-treatment outcomes and covariates
SCM (N = 7) This method constructs a comparator, the synthetic control, as a weighted average of the available control individuals. The weights are chosen to ensure that, prior to the treatment, levels of covariates and outcomes are similar over time to those of the treated unit.
Imperfect SCM (N = 1) Extension of SCM method with relaxed assumptions that allow outcomes to be functions of transitory shocks.
Generalized SCM (N = 2) Combines SC with fixed effects.
Synthetic DiD (N = 1) Both unit and time fixed effects, which can be interpreted as the time-weighted version of DiD.
LDV regression approach (N = 1) Adjusts for pre-treatment outcomes and covariates with a parametric regression model. Alternative approach to DiD.
Trend-in-trend (N = 1) The trend-in-trend design examines time trends in outcome as a function of time trends in treatment across strata with different time trends in treatment.
PERR (N = 3) PERR adjustment is a type of self-controlled design in which the treatment effect is estimated by the ratio of two rate ratios (RRs): RR after initiation of treatment and the RR prior to initiation of treatment.
PS calibration (N = 1) Combines PS and regression calibration to address confounding by variables unobserved in the main study by using variables observed in a validation study.
RD (N = 4) Method used for policy analysis. People slightly below and above the threshold for being exposed to a treatment are compared.

References

Varga, A. N., Guevara Morel, A. E., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single‐point exposure. Statistics in Medicine.