How I read studies
A skeptical, LLM-amenable checklist for saying "thank goodness I don't need to read this or update my beliefs in any way"
A friend recently texted a groupchat I’m on this fun tweet:
To which I replied, “[Friend], I’m afraid we’re going to need to have a conversation about Plausible Effect Sizes…[in which I] adduce 15 reasons why there is no plausible way that ginger increases testosterone by 20%. I would have accepted 1% or 2%”. (This is exactly how I’m going to talk to MAHA, by the way.) I then actually read the posted pictures rather than just the highlighted parts, and noticed, from a 2021 systematic review: “there was no statistically significant difference in the total number and the motility of between treated and control groups (Hosseini et al, 2016).” So there’s your first gotcha: the tweeter is obviously cherry-picking results. Seeing this didn’t take any special expertise. I just had to read the text.
There are more issues with the research,1 which I merrily bombarded the chat with. But reflecting on this got me thinking about explaining my process, because the way I read ginger-and-testosterone papers is very similar to how I read MAP reduction papers. It’s essentially a checklist, or a game of 20 questions, where my goal is to not have to read the whole paper because reading a single paper carefully typically takes many hours. If my answer to many or sometimes any of these questions is “no,” unless there’s some special reason to read it anyway, I put it down having proudly learned nothing. So can you.
My aim here is to equip you for the next time someone makes a big, empirical, cause-and-effect claim — say, that increased healthcare spending tends to lead to better health outcomes, or female-named hurricanes are deadlier because people do not take them seriously, or tylenol during pregnancy cause autism — that you think is probably bullshit. I want you to feel empowered to skim the relevant study, ask a few questions, and begin a semi-scripted response that sounds something like, “alas, this study is not credible/does not show what you say it does because…”
What would make me inclined to believe a study before reading it?
Any randomized controlled trial (RCT) published in a major economics or political science journal — American Economic Review, Quarterly Journal of Economics, Journal of Political Economy, The Review of Economic Studies, American Political Science Review, American Journal of Political Science, and Journal of Politics — has probably already passed much more stringent checks than those I’ve come up with, and I will basically take its results at face value (though of course no single paper is decisive). I also have beliefs about disciplines and journals whose results I will dismiss out of hand, but it is impolite to specify.
Questions to ask about studies
I wrote this post with LLMs in mind, so you can upload a PDF and this post and ask Claude to run through it. Also note that questions 2-5 reflect the PICO framework for medical research.
Is it an RCT? If not, do you have strong reason to believe that all else but the intervention is approximately equal between the treatment and control groups?
If a study is non-randomized, it might be helpful to think of reasons to doubt treatment-control equivalence. I do a bit of this in my dissection of plant-based defaults.
Is the sample drawn from the population we want to know about? The clearest version of this is “in mice,” i.e. noting that a study didn’t have any human subjects. But you also might find a study on the effects of “super shoes” making very general claims despite only looking at elite male athletes who can run a 10K faster than the average adult woman can run a 5K. It’s reasonable to wonder whether effects hold for the world at large. (A lack of female participants is persistent problem in exercise research.)
Is the intervention the thing we actually want to test? Suppose someone cites a study purporting to show that ice baths reduce running injuries. Alas, the study actually tests immersing your wrists in cold water and then assesses physiological stress through “average rectal temperature” (yikes!). The intervention is not the thing any reasonable person would actually do to reduce running injuries.
Does the baseline (comparator) make sense? Suppose we want to assess the effects of seasonal malarial chemoprevention (SMC) on malaria cases. We then look at the Cochrane review on the subject (Meremikwu et al., 2012), which aggregates the results of seven studies. We see that four of those studies, to quote something I wrote in 2022, “test SMC in settings where both treatment and control samples are already receiving anti-malaria interventions. Two studies test SMC along with “home-based management of malaria (HMM)” while two others test SMC “alongside ITN [insecticide treated nets] distribution and promotion” (p. 9).” So now we have just three studies that are really testing the effects of SMC vs. no treatment, and the others are testing SMC + something else vs. something else. Is that the same estimand as the effect of SMC on its own? I have no idea.
Is it the outcome we care about? This one is usually easiest to scan for. I would stop reading the ice bath study when I got to the part in the abstract about average rectal temperatures, for example. The supershoes study measures running economy using a treadmill and a huge breathing apparatus but has nothing on running performance.
Are the samples large enough to say anything meaningful? This is a judgment call, but in general, if a study has fewer than ten people in either treatment or control, I’m not interested. My personal cutoff is 25 subjects in both treatment and control based on a finding in Paluck et al. (2021) that studies with fewer people than that tended to have the largest effect sizes, which suggests some selection bias in low-powered studies.
Is the effect size plausible? A large, surprising effect size isn’t necessarily a red flag, but the maxim “extraordinary claims require extraordinary evidence” holds. A 20% increase in testosterone on average implies that half of observed gains were even larger than that. Doesn’t that sound like a big increase from just eating a stem? Do we observe the kinds of effects in the world that we’d expect if that relationship held, e.g., much higher testosterone rates in cultures that eat a lot of ginger?2
Is the relevant comparison being made between groups? Something to remember: the difference between significant and not significant is not itself significant. This comes up with papers that have baseline and post-test measures for both treatment and control, and then tell you that the change over time for people who received treatment was significant, but the change for people who didn’t was not. (Example language: “In the first intervention group, the rating increased significantly (p = 0.001) after having watched the video. There was no such effect in the second intervention or the control group.”) I take this as evidence that the authors didn’t find a meaningful difference between the two groups. If they had, they would say so.
Is it published in a real journal? Not to keep picking on this poor ginger study, but I am afraid I’ve never heard of the medical journal of Tikrit University. There are a lot of nonsense journals and papers out there. I have more context for evaluating something published on a preprint server than I do this.
Evidence of open science practices: pre-registration, open data, reproducible code? I expect modern experiments to have all three or a good story about why they don’t.
Do you want the results to be true? If a paper gives you hope or joy (defaults are super effective!) or confirms your worldview (conservatives are more susceptible to false information!), it’s time to double down on your skepticism. Sorry 😃
It’s a fun and strange literature. First, I notice that a lot of the work seems to be on rats, and a 2018 piece notes that “the effect of ginger on testosterone is not yet confirmed in humans.” Are rats a good model organism for human infertility? Who knows? Second, for some reason, every piece I read was done by researchers at universities in the Middle East. Is ginger a widely touted folk remedy for infertility there? Is it to the west’s discredit that we’re not conducting large-scale RCTs on this popular and eminently testable theory? Will there eventually be a MAHA university where they test this stuff rigorously? These papers are a crab emerging from a dark pool. What else swims in the loomy gloom?
Daniel Lakens phrases this well in a comment about a study on hungry judges: “If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion…we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal - you would already know it exists.” I am reminded of something funny Nora Ephron wrote in 2003 about her (not) having an affair with JFK: “I assure you if anything had gone on between the two of us, you would not have had to wait this long to find it out.”






Fantastic post, Seth. I have sent this to the entire team at Bryant Research as essential reading!
Cheers,
Richie