The goal of this note is to introduce a way to describe the results of clinical trials using concepts that are easy to communicate to non-statisticians. In particular, I propose that clinical trials should be viewed as answering the question “how effective will this treatment be?” rather than “is this treatment effective?” That is, we should report the size of the treatment effect based on an estimate for how the treatment will benefit a new group of subjects. To take into account uncertainty, we should be conservative and estimate the smallest treatment effect that we would expect to observe over many repeated studies.

I introduce an estimate called the Smallest Effect over Repeated Studies (SERS) which is defined as the smallest treatment effect we would expect to observe if we repeated an experiment M times. For example, the following statements describe the results of the same hypothetical clinical trial:

NHST [1]: We estimated that the treatment increased overall survival by 3 months relative to the control; however, if the treatment were not more effective than the control, we would observe an equal or treatment effect with probability p ≤ 0.01.

SERS: We would expect the treatment to increase overall survival by at least 1 month relative to the control if we repeated the experiment M = 10 times [2].

To be concrete, I derive the SERS estimate for the difference in the mean outcome in the treatment group and the mean outcome in the control group.

Randomized clinical trials (RCTs) are used in medicine to determine if an experimental treatment (“the treatment”) is more effective than the current standard of care and/or placebo (“the control”) for a given disease. In an RCT, a well-defined group of subjects are randomly assigned to receive the treatment or the control. At the end of the trial, the subjects who received the treatment are compared to those who received the control to estimate the effect of the treatment. For example, the effect of a hypothetical treatment could be that “the treatment improved overall survival by 3 months relative to the control, on average.”

Most health outcomes are probabilistic and, as a result, it isn’t possible to determine a treatment effect exactly. Therefore, researchers use statistical tools to determine if an observed effect is reliable. One of the main tools that researchers use to assess the reliability of an observed effect is based on a statistical concept known as significance. To assess the statistical significance of an observed treatment effect, researchers estimate the probability of observing an equal or larger effect in similarly designed studies if, in reality, the treatment is no more effective than the control. This probability is called a p-value. If the p-value is less than an agreed upon threshold (usually p = 0.05), then the researchers can’t rule out the null hypothesis that the treatment isn’t more effective than the control.

Reliance on Null-Hypothesis Significance Testing (NHST) is a controversial topic that has generated quite a lot of recent discussion [3,4,5,6]. Proponents argue that using agreed-upon tests for statistical significance in well-designed RCTs can prevent treatments that don’t work from being sold to patients. Opponents say that relying on statistical significance alone can lead to poor decisions because it provides a black-and-white answer to a problem with shades of grey; e.g., is an observed effect with p = 0.049 really more reliable than one with p = 0.051?

My main complaints with null-hypothesis significance testing are twofold:

- It is confusing, and this leads to many misunderstandings by non-statisticians.
- It doesn’t really help us answer the main question “How big is the treatment effect?”

In fact, the second point is worth a closer look given the first.

There is essentially no such thing as a treatment effect that is exactly zero. If the treatment only consisted of an extra glass of water per day, one would expect it to have a least some effect, even if that effect is extremely small.

The theory underlying NHST is based on controlling the probabilities of two types of errors [7]. The type I error rate is the probability that we reject the null-hypothesis given that it is true. The type II error rate is the probability that we fail to reject the null hypothesis even if the true treatment effect is non-zero.

But this is a straw-man. The treatment effect is surely not exactly zero. What we are really worried about are treatments with effects that are so small they are practically zero. Researchers achieve this last goal by running experiments for which the sample size is too small to reliably estimate such small effects (i.e., they are underpowered). When studies that are underpowered show significant effects, these effects will be generally be overestimates [8].

These points combine to create the following situation. Researchers run a study with power that is too low to detect small effects and observe a treatment effect with p < 0.05. The treatment is declared effective and given to patients. Follow-up studies show that the treatment is not as effective as suggested by the original study.

I think we would be better off with an approach that produces reliable results that are easier for everybody to understand.

Medical outcomes are probabilistic. As a result, a researcher will obtain a different estimate for the treatment effect each time he/she repeats a study. To be conservative, it makes sense to focus on the smallest you would expect one of these estimated treatment effects to be. I define the Smallest Effect over Repeated Studies (SERS) estimator as the smallest treatment effect one would expect to observe if the experiment is repeated M times.

For example, suppose that we are interested in estimating the difference in the mean of some outcome between the treatment and control groups. In the appendix, I show that the SERS estimate for the difference of two means is

Here, μ_Δ and σ_Δ are the expected value and standard deviation of the difference in the two means taken with respect to the posterior predictive distribution, and M is the number of times the experiment will be repeated. The SERS estimate is approximately

Δ_min ≅ μ_Δ - 2.15 σ_Δ for M = 10

Δ_min ≅ μ_Δ - 3 σ_Δ for M = 100

As this looks a lot like the lower bound of a confidence interval [9], we can get values of M that roughly correspond to typical decision rules: (Medicine) p ≤ 0.05 ↔ M = 7, (Physics) 5 sigma rule ↔ M = 250,000.

As I have described it, yes. Bayesian statistics provides a distribution called the posterior predictive distribution, which assigns a probability to future observations given past observations. I suggest using this distribution to compute μ_Δ and σ_Δ. In general, I think that posterior predictive distributions are the coolest thing we get from Bayesian statistics. They allow us to predict new observations, they are falsifiable (e.g., if you use a terrible prior then you will make terrible predictions, and that’s something you can actually test), and the concept is kind of difficult to translate into the frequentist framework.

People tend to get hung up on prior beliefs when discussing Bayesian inference and forget to talk about the role of preferences. A posterior distribution doesn’t give you a single number that you can call “the estimate for the treatment effect;” you have to choose a summary statistic to use as the estimate. Researchers typically choose the posterior mean because it minimizes the mean squared error, but different utility functions lead to different estimates. For example, Baron [10] showed that the Bayesian estimate for the difference in means under a utility function that penalizes overestimating the treatment effect is

in which λ≥0 is a parameter that describes the additional cost of overestimating the treatment effect relative to underestimating the treatment effect.

A SERS estimate and the lower bound of the confidence interval have similar formulas, but different interpretations. For example an ɑ% confidence interval is a random interval computed so that, if you were to repeat the experiment many times and compute an ɑ% confidence interval each time, then ɑ% of these intervals would contain the true treatment effect. In practice, a 95% confidence interval for the difference in two means is Δ_± = μ_Δ ± 1.96 σ_Δ.

The concept of statistical power doesn’t make a lot of sense if you accept the argument that treatments may have extremely small, but never exactly zero, effects. Nevertheless, we can compute something analogous to power: the probability that the SERS estimate will be less than or equal to some threshold for clinical significance, P_M(Δ_min is too small). This is something a bit like the probability of making a type-S error [11]. Of course, to do this calculation one needs to have some prior estimates for the statistics of the control and treatment groups, but you need that for any power calculation. In the case of the difference between two means the SERS estimate has the same formula as the lower bound of a confidence interval, so the “power” calculation looks very similar to a classical power calculation.

Suppose that we collect samples {x_1, …, x_N} and {y_1, …, y_N} from the control and treatment groups, respectively. We are interested in estimating the difference in the means of these two groups. The test statistic is

For sufficiently large N, the test statistic will be approximately normally distributed with mean μ_Δ and standard deviation σ_Δ . Estimates for the mean and variance of the test statistic can be obtained by computing the moments of the test statistic with respect to the posterior predictive distribution

or by using the bootstrap [12].

If we were to repeat the experiment M times, the probability δ is the minimum estimate for the difference in means we would observe is [13]

Here, p(δ) and P(δ) are the PDF and CDF, respectively, of a normal distribution with mean μ_Δ and standard deviation σ_Δ . We define the SERS for the difference in means as

To find the maximum, we set the derivative of the natural logarithm of p_min(δ) to zero,

which yields a transcendental equation for Δ_min

This equation is difficult to solve in general, but the solution can be approximated for sufficiently large M as [14]

Footnotes

1: Null-Hypothesis Significance Testing (NHST)

2: The astute reader may notice that the phrase “we would expect” sounds quite Bayesian, and it is. But I don’t want to turn this into a Bayesian vs Frequentist piece because I don’t really like the Bayesian hypothesis test for P(treatment effect > 0 | data) either; my opinion is that conservative estimates for effect sizes are better than hypothesis tests regardless of how those tests are done.

3: https://www.nature.com/articles/d41586-019-00857-9

4: https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108#.XRqH4pJKhrn

5: https://www.nature.com/articles/d41586-019-00969-2

6: https://www.americanscientist.org/article/the-statistical-crisis-in-science

7: https://statmodeling.stat.columbia.edu/2004/12/29/type_1_type_2_t/

8: https://www.sciencedirect.com/science/article/pii/S0749596X18300640

9: I’m being loose with terms here because I suggest that the mean and variance be computed from a Bayesian posterior predictive distribution, but if the priors are flat then this is the same as the formula for the lower bound of a confidence interval and I am comparing those formulas rather than their interpretations.

10: https://www.degruyter.com/view/j/strm.2000.18.issue-4/strm.2000.18.4.367/strm.2000.18.4.367.xml

11: http://www.stat.columbia.edu/~gelman/research/published/retropower20.pdf

12: See equation 3.31 in https://arxiv.org/pdf/1301.2936.pdf

13: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.87.198103