This is the third post in our series on finding exact matches between treated and control subjects in clinical trials. In this post, we describe and evaluate propensity score matching, another commonly used method for subject matching. **While propensity score matching makes it easier to match subjects with complex covariates, it generates a less precise estimate of how effective a drug is. **

In previous posts we discussed historical controls for clinical trials. Recall that a clinical trial with a historical control is one in which the control data are borrowed from control data from a trial that was previously run. In that setting we have to be careful about how we compare the control data with the treatment data because the two subject populations need to be as similar as possible. Any imbalance can lead to error and uncertainty in estimating whether the new treatment is effective.

In the ideal scenario, we would pair each treated subject to a historical subject whose baseline covariates are identical to those of the treated subject. This approach is called direct matching. Unfortunately, we showed that such direct matching is usually infeasible in practice. This is especially true for trials dealing with complex diseases like Alzheimer’s Disease or Multiple Sclerosis because studies for these diseases measure a large number of covariates that potentially influence the outcome. As a result, direct matching must be performed with so many covariates that there will simply not be enough matches to build a historical control group of an appropriate size.

One of the most popular approaches to surmounting this problem is called propensity score matching (PSM). Rather than finding matches based on a large number of covariates (as is the case for direct matching), PSM attempts to find matches between the treated subjects and historical subjects on the basis of a single variable, the propensity score (PS). As a result, PSM provides a feasible means of estimating the treatment effect—but at a potentially steep cost. As we will see throughout this post, this estimate can be noisier than it need be, reducing the precision of the clinical trial.

The *propensity score* of a subject is the probability that this subject came from the treatment group versus the control group, given the subject’s baseline covariates.

Since PSM uses only one variable to match, it is more tractable than direct matching. PSM also ensures a certain kind of balance between the treatment and control groups.

Before we go into PSM in more detail, let’s talk about the concept of balance and why it’s so important. As mentioned earlier, the primary point of matching control subjects to treated subjects is to reduce imbalance, or dissimilarity, between the experimental group and the control group. If there are differences between treatment and control populations that impact the endpoint(s) we are measuring in a clinical trial, then those differences can introduce bias in estimating how effective the treatment is—also known as the average treatment effect (figure 1A).

To illustrate this point, let’s consider the following simple example.

Suppose we are testing a drug that is known to be more effective in one population than another. Let’s say it is more effective in population 2 than in population 1 (figure 1B). We are not completely sure of why, but we are confident that the covariates we are measuring in the trial capture the differences between the populations that largely account for the difference. Now suppose that the group of subjects making up the treatment group in a clinical trial consist of a 50%-50% mix of the two populations. And suppose that we want to draw a control group from a historical control subjects that has 60%-40% proportions. If we simply draw the control at random from the historical population, then the population 1 vs. 2 proportions will differ between treatment and control group. Since the drug is more effective for population 2, if we compute the difference between the mean endpoint values between the treatment group and the control group we will find that this absolute difference is larger than it should be —overestimating the average treatment effect—because there are comparatively more people from population 2 in the treatment group (figure 1C).

The ideal solution for this situation would be direct matching, which ensures that there is little imbalance in a very strong way: each treated subject has an exact counterpart who didn’t receive a treatment. (In our example this would mean matching each treated subject to a control subject from the appropriate population.) Then comparing differential outcomes between matched pairs gives an estimate of the average treatment effect, which is independent of the covariates. However, direct matching is often not possible, so there are other ways—albeit less effective—to reduce error in estimating the average treatment effect.

Suppose we could identify a feature in the covariates which would make it more or less likely that a subject came from the treatment group or the control group. This feature would represent a kind of imbalance between the treatment and control group. The propensity score quantifies this kind of imbalance.

In a fully randomized study, the propensity score of any subject should be roughly 50%—that is, there is an equally likely chance that each subject would have been assigned to either group given their baseline measurements. However, if we compare a treatment group with a control group consisting of historical data, the propensity score of many subjects might be significantly greater or less than 50%.

To illustrate how we estimate propensity scores, we’ll use an example and show two plots. In our example, we have a treatment population of 100 subjects for which two baseline covariates have been measured (figure 2A). We have a control population of 1000 subjects that doesn’t perfectly overlap the treated distribution. We fit a logistic regression model that assigns a probability to each subject of falling into the treatment group (figure 2A, B). This model then provides estimates of the propensity score for each subject.

We can create a control population by matching treated subjects to control subjects according to propensity scores. First we take the potentially imbalanced treatment and control groups and estimate the propensity score with a statistical model—usually logistic regression (figure 2A, B). Then we can create a new control group by matching historical control subjects to treated subjects on the basis of their estimated propensity scores. Two subjects are matched if they have similar propensity scores. If the matching is successful, each treated subject should have a match in the control group that has roughly the same propensity score (figure 2C). It’s also important to note (figure 2C) that even when two subjects match on propensity score, they might not be that close together in terms of their baseline covariates.

If we look at this restricted group of matched patients, the likelihood of any subject being in the treatment or control is roughly 50%, ensuring the kind of balance found in randomized trials. Put another way, restricted to the matched populations, the propensity score model cannot accurately divide the two groups.** **In the original paper on the topic [1], Rosenbaum and Rubin showed that as long as all variables that influence on treatment/control group selection are included among the measured covariates, then matching on the propensity score is sufficient to estimate the average treatment effect without bias. But as we have mentioned and will go into shortly, the variance of the ATE might still be large.

Although PSM is a commonly-used alternative to direct matching, some experts suggest that it isn’t a very good means of selecting a control group [2]. So what is the downside?

We have mentioned before that finding an exact match for treated subjects is the most desirable way to compare the treatment and control groups. The reason for this has to do with the *variance* of the average treatment effect estimated by direct matching. Imagine an idealized scenario in which the distributions of control and treatment subjects are known; we draw samples from these two distributions in order to fill out a treatment and control group, and then compute the estimated average treatment effect. If we repeat that experiment many times, the estimated average treatment effect will form some distribution around the true average treatment effect. The variance of the estimator is a measure of how much that distribution is spread out around the true value (figure 3).

In such iterated testing, both PSM and direct matching will have distributions which are *unbiased* -- namely that the mean treatment effect over all repetitions will approach true effect as the number of subjects gets large. However, the PSM estimator can be potentially much more variable. This variability can be a huge downside because the more variance the average treatment effect estimate has, the less power that trial has to detect a given effect (figure 4). Put another way, the more noise in the channel, the less likely we are to be able to detect a signal (see appendix).

*Here’s the biggest takeaway from the post: in the best case scenario PSM is able to reduce complexity—going from matching on many variables to one. However in doing so, it generates a new problem: the variance of the treatment effect estimate becomes larger. As a result, the estimate of the treatment effect is less precise, and the trial requires more subjects in order to observe the same effect. * Thus, in our view—and the views of a number of experts in clinical statistics and econometrics—**propensity score matching can’t sufficiently overcome the complexity problem presented by direct matching** [2, 3, 4].

In the next blog post we will describe a superior means of surmounting direct matching’s complexity problem—specifically describing a technology that allows us to generate synthetic subject records matched exactly to the baseline covariates of any treated subject. Such a match is called a *digital twin.* A digital twin shows how a treated subject might have changed over time had they not received a treatment. Supplementing or completely replacing a control group with digital twins provides a control population that is perfectly balanced to the treatment group, yielding a higher power estimate of the average treatment effect.****

*Written in collaboration with Joy Chiew

1. PAUL R. ROSENBAUM, DONALD B. RUBIN, The central role of the propensity score in observational studies for causal effects, *Biometrika*, Volume 70, Issue 1, April 1983, Pages 41–55, https://doi.org/10.1093/biomet/70.1.41

2. King G, Nielsen R. Why propensity scores should not be used for matching. Political Analysis. 2016 Feb 2:1-20.

3. Li S, Vlassis N, Kawale J, Fu Y. Matching via Dimensionality Reduction for Estimation of Treatment Effects in Digital Marketing Campaigns. InIJCAI 2016 Jul 9 (pp. 3768-3774).

4. Aikens RC, Greaves D, Baiocchi M. Using the Prognostic Score to Reduce Heterogeneity in Observational Studies. arXiv preprint arXiv:1908.09077. 2019 Aug 24.

We consider a simple simulation to demonstrate this explicitly. Suppose we want to analyze the effect of a clinical intervention on subjects coming from potentially imbalanced treatment and control populations. We assume that we care about just two continuous covariates (X_1, X_2) and their effects on a continuous outcome variable Y. We assume the true relation between the clinical covariates and the treatment and Y is given by a linear equation,

in which T is an indicator of whether the patient (X_1, X_2) received the treatment or not. Therefore by construction the true treatment effect is 1.

We consider a scenario in which a treatment cohort of 100 subjects is selected from an isotropic Gaussian distribution centered at (0,0) with standard deviation 1. A control cohort is selected from an identical Gaussian distribution, but we let its mean shift from (0,0) to (2,2) in some number of steps. This incrementally simulates increasing imbalance between the treatment and control cohorts. Because historical data can potentially be more plentiful, we draw 1000 subjects for the control group in each simulation.

In this setting we consider the performance of PSM vs. direct matching for estimating the average treatment effect. Crucially we will compute the variance of each estimator over 1000 random trials. In these examples we perform 2:1 control-to-treatment matching, and perform trimming in order to remove treated subjects who have no good matches in the control group (supplementary fig. 1).

By running these experiments 1000 times we can compute variances for the treatment effect estimated by each of the methods and compare them. We find that direct matching is consistently less variant than PSM by a wide margin (supplementary fig. 2). The end result is that the power of a trial analyzed with PSM vs direct matching is significantly reduced (supplementary fig. 3).