Yumi's Blog

Gentle Hands-on Introduction to Propensity Score Matching

Introduction

Data Scientists need to help making business decisions by providing insights from data. To answer to the questions like "does a new feature improve user engagement?", data scientists may conduct A/B testing to see if there is any "causal" effect of new feature on user's engagement, evaluated with certain metric. Before diving into causal inference in observational study, let's talk about more common approach: A/B testing and its limitation.

A/B testing

In my opinion, A/B testing is a rebranded version of traditional experimental design in IT industry to find statistically significant causal effect. In clinical trials, experimental design is used to assess whether there is a significant improvement in using new drug in comparisons to current status quo. A/B testing also takes random subset of target population group and randomly assign the users into treatment (new feature) and control groups to see if user experience improves by the new features. Here, the random assignment of subjects to treatment and control groups takes the key role to make causal statement. The random assignments assures that there is no confounding factors.

Let's think more details on this "random assignment". Suppose we do not randomly assign subjects to treatment or placebo groups, and let the subjects choose to take treatments with their own will. Suppose also that t-test found that the mean of treatment group has significant increase in its health status than mean of those who do not take the treatment. Can we conclude that the treatment has statistically significant effect on improving the health status? Yes. But is it causal effect?

The answer is no, because of confounding factors. For example, those who choose to take treatments happened to be more health-conscious, or happened to be healthier from the beginning to afford the pain of the treatment. These confounding effects, that are highly correlated to the treatment group subjects, may be the true root cause of the better health status.

To average out these confounding effects, A/B testing needs to randomize subjects. We hope that by randomly allocating individuals to treatment and placebo groups, both group has same level of Health consciousness or the same level of health status at the beginning "on average".

Observational study

We understand that A/B testing is useful to find statistically significant causal effect. But there are many scenario where we cannot do A/B testing. For example, some features cannot be A/B tested because of some engineering constraints. Or some features cannot be A/B tested because company has no control over the feature. E.g. what if LinkedIn wants to test the effect of profile picture on the hiring decision, LinkedIn cannot randomly assign subset of individuals to have no profile pictures! In such scenario, how can we eliminate confounding factors and make causal analysis?

In this blog, we learn:

  • Propensity Score Matching (PCM) technique, one of the most common technique in observational study to do causal inference.

  • How to analyze data with PCM in R.

Propensity Score Matching

PSM attempts to average out the confounding factors by making the groups receiving treatment and not-treatment comparable with respect to the potential confounding variables. The propensity score measures the probability of a subject to be in treatment group, and it is calculated using the potential confounding variables. If the distribution of the propensity scores are similar between treatment and placebo, we can say that the confounding factors are averaged out.

PCM tries to make the distribution of PCM the same between treatment and non-treatment group by matching each subject in treatment group with another subject in non-treatment group in terms of the propensity score. It may happen that the same subject in treatment to be matched with multiple subjects in placebo (Upsampling), or some subjects may not be used for matching and hence discarded from the analysis (Downsampling).

PSM procedure

Here is the general PSM procedure.

Step 1. Run logistic regression:

  • Dependent variable: Z = 1, if a subject is in treatment group; Z = 0, otherwise.
  • Independent variable confounding factors. The probability estimate of this logistic regression is propensity score.

Step 2. Match each participant to one or more nonparticipants on propensity score by nearest neighbor matching, exact matching or other matching techniques.

Step 3. Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample.

Hands-on PSM analysis

Now, we are ready to analyze observational study data. We will closely follow the seminal tutorial R Tutorial 8: Propensity Score Matching.

This tutorial analyzes the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.

First, follow the link above to download the data ecls.csv.

In this data,

  • indepdent variable: catholic (1 = student went to catholic school; 0 = student went to public school)
  • dependent variable: c5r2mtsc_std students’ standardized math score
In [1]:
library(MatchIt)
library(dplyr)
library(ggplot2)

ecls <- read.csv("ecls/data-processed/ecls.csv")
cat("dim:",dim(ecls))
head(ecls)
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


dim: 11078 22
A data.frame: 6 × 22
childidcatholicracerace_whiterace_blackrace_hispanicrace_asianp5numplap5hmagep5hdagew3daded_hsbw3momed_hsbw3momscrw3dadscrw3inccatw3incomew3povrtyp5fstampc5r2mtscc5r2mtsc_std
10001002C0WHITE, NON-HISPANIC1000 14745 0 053.5077.5$50,001 TO $75,000 62500.5 0 060.0430.9817533
20001004C0WHITE, NON-HISPANIC1000 14148 0 034.9553.5$40,001 TO $50,000 45000.5 0 056.2800.5943775
30001005C0WHITE, NON-HISPANIC1000NANANANANA NA NANA NANANA53.7910.3381515
40001010C0WHITE, NON-HISPANIC1000 14355 0 063.4353.5$50,001 TO $75,000 62500.5 0 055.2720.4906106
50001011C1WHITE, NON-HISPANIC1000 13839 0 053.5053.5$75,001 TO $100,000 87500.5 0 064.6041.4512779
60001012C0WHITE, NON-HISPANIC1000 14757 0 061.5677.5$100,001 TO $200,000150000.5 0 075.7212.5956991

Let's look at the distribution of the standardized math score for catholic vs public school individuals. The boxplot shows that the median of the two groups seem quite different.

In [2]:
boxplot(c5r2mtsc_std ~ catholic, data = ecls,
        xlab="catholic",
        ylab="standardized math score")

Ignoring confounding factors, let's see if there is any significant difference between the catholic and pulic school in terms of the mean of standardized math scores.

I use Welch's two sample T-test, the most general form of T-test which does not assume equal sample size or same variance across the two groups.

In [3]:
t.test(c5r2mtsc_std ~ catholic,data=ecls)
	Welch Two Sample t-test

data:  c5r2mtsc_std by catholic
t = -9.1069, df = 2214.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2727988 -0.1761292
sample estimates:
mean in group 0 mean in group 1 
    -0.03059583      0.19386817 

The t-test shows that the mean of the standardized test results are significantly different. But does going to catholic school caused the better performance on the standardized test? Let's consider confounding factors!

Following R Tutorial 8: Propensity Score Matching, I will use the following variables as potential confounding factors to model propensity score.

  • race_white: Is the student white (1) or not (0)?
  • p5hmage: Mother’s age
  • w3income: Family income
  • p5numpla: Number of places the student has lived for at least 4 months
  • w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
In [4]:
ecls_cov <- c('race_white', 'p5hmage', 'w3income', 'p5numpla', 'w3momed_hsb')
ecls %>%
  group_by(catholic) %>%
  select(one_of(ecls_cov)) %>%
  summarise_all(funs(mean(., na.rm = T)))
Adding missing grouping variables: `catholic`

Warning message:
“funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
A tibble: 2 × 6
catholicrace_whitep5hmagew3incomep5numplaw3momed_hsb
00.556124637.5609754889.161.1326690.4640918
10.725165639.5751682074.301.0927010.2272069

The means of all these factors are significantly different between the catholic and public schools.

In [5]:
print(ecls_cov)
lapply(ecls_cov,
       function(x) {
           t.test(ecls[,x] ~ ecls$catholic)
           })
[1] "race_white"  "p5hmage"     "w3income"    "p5numpla"    "w3momed_hsb"
[[1]]

	Welch Two Sample t-test

data:  ecls[, x] by ecls$catholic
t = -13.453, df = 2143.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1936817 -0.1444003
sample estimates:
mean in group 0 mean in group 1 
      0.5561246       0.7251656 


[[2]]

	Welch Two Sample t-test

data:  ecls[, x] by ecls$catholic
t = -12.665, df = 2186.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.326071 -1.702317
sample estimates:
mean in group 0 mean in group 1 
       37.56097        39.57516 


[[3]]

	Welch Two Sample t-test

data:  ecls[, x] by ecls$catholic
t = -20.25, df = 1825.1, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29818.10 -24552.18
sample estimates:
mean in group 0 mean in group 1 
       54889.16        82074.30 


[[4]]

	Welch Two Sample t-test

data:  ecls[, x] by ecls$catholic
t = 4.2458, df = 2233.7, p-value = 2.267e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02150833 0.05842896
sample estimates:
mean in group 0 mean in group 1 
       1.132669        1.092701 


[[5]]

	Welch Two Sample t-test

data:  ecls[, x] by ecls$catholic
t = 18.855, df = 2107.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2122471 0.2615226
sample estimates:
mean in group 0 mean in group 1 
      0.4640918       0.2272069 

Propensity score estimation

To calculate propensity score and find the match, we will use matchit function. The propensity score is calculated by fitting logistic regression with the potential confounding factors as independent variables and the school type as a dependent variable.

The logistic regression fit can also be done using glm with family="binomial".

To find the match, I will use nearest neighbor matching.

In [16]:
# matchit does not allow missing values in data
variable_in_use <- c(ecls_cov,"catholic","c5r2mtsc_std")
omitTF <- apply(ecls[,variable_in_use],1,
               function(x)any(is.na(x)))
ecls_nomiss <- ecls[!omitTF,variable_in_use]

mod_match <- matchit(catholic~ race_white + p5hmage + w3income + p5numpla + w3momed_hsb,
                     data=ecls_nomiss,method = "nearest")

Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample.

Summary shows that the mean of each of the confounding variables is less than one SD away between the catholic and public schools.

"Summary of balance for all data" simply shows the mean of catholic and public schools. We already studied these in the Welch's T-test above. What interesting is that the Mean Diff of the confounding variables substantially reduced when data is matched. See Summary of balance for matched data. It seems that the matching is well done.

In [17]:
summary(mod_match)
Call:
matchit(formula = catholic ~ race_white + p5hmage + w3income + 
    p5numpla + w3momed_hsb, data = ecls_nomiss, method = "nearest")

Summary of balance for all data:
            Means Treated Means Control SD Control  Mean Diff  eQQ Med
distance           0.1927        0.1379     0.0845     0.0549 5.67e-02
race_white         0.7411        0.5914     0.4916     0.1497 0.00e+00
p5hmage           39.5932       37.5658     6.5506     2.0274 2.00e+00
w3income       82568.9357    55485.0210 43961.0872 27083.9146 2.50e+04
p5numpla           1.0917        1.1298     0.3910    -0.0380 0.00e+00
w3momed_hsb        0.2234        0.4609     0.4985    -0.2375 0.00e+00
              eQQ Mean  eQQ Max
distance        0.0548 7.60e-02
race_white      0.1501 1.00e+00
p5hmage         2.2544 7.00e+00
w3income    27069.1775 6.25e+04
p5numpla        0.0399 2.00e+00
w3momed_hsb     0.2374 1.00e+00


Summary of balance for matched data:
            Means Treated Means Control SD Control Mean Diff eQQ Med  eQQ Mean
distance           0.1927        0.1927     0.0846    0.0000       0    0.0000
race_white         0.7411        0.7470     0.4349   -0.0059       0    0.0059
p5hmage           39.5932       39.5503     5.2243    0.0429       0    0.0873
w3income       82568.9357    81403.9926 46618.2406 1164.9430       0 1164.9430
p5numpla           1.0917        1.0762     0.2970    0.0155       0    0.0200
w3momed_hsb        0.2234        0.2152     0.4111    0.0081       0    0.0081
             eQQ Max
distance    3.30e-03
race_white  1.00e+00
p5hmage     1.00e+01
w3income    6.25e+04
p5numpla    2.00e+00
w3momed_hsb 1.00e+00

Percent Balance Improvement:
            Mean Diff. eQQ Med eQQ Mean  eQQ Max
distance       99.9934     100  99.9689  95.6780
race_white     96.0477       0  96.0591   0.0000
p5hmage        97.8841     100  96.1286 -42.8571
w3income       95.6988     100  95.6964   0.0000
p5numpla       59.1653       0  50.0000   0.0000
w3momed_hsb    96.5746       0  96.5732   0.0000

Sample sizes:
          Control Treated
All          7915    1352
Matched      1352    1352
Unmatched    6563       0
Discarded       0       0

Plot function of matchit allows calculating empirical quantile-quantile plots plots of each covariate to check balance of marginal distributions when type = "QQ". Notice that the distribution of covariates between catholic (y-axis) and public (x-axis) becomes more comparable after data is matched.

With type="hist", the plot outputs histograms of the propensity score in the original treated and control groups. After data is matched, the propensity score distribution between the catholic and public school become more comparable.

These observations indicate that the matched samples averaged out the potential confounding factors.

In [19]:
plot(mod_match,type="QQ")
plot(mod_match,type="hist")

Back to Welch's T-test

Now perform Welch's T-test again with matched sample. Do we still have significant p-value? Yes. We do. However, it shows that going to catholic school actually negatively affects the standardized test after controlling confounding. This analysis indicates that going to catholic schools is not the root cause of better performance on standardized test and it could actually hurt the performance on the standardized test!

In [20]:
dta_m <- match.data(mod_match) ## obtain the dataframe of matched samples.
head(dta_m) 
with(dta_m, t.test(c5r2mtsc_std ~ catholic))
A data.frame: 6 × 9
race_whitep5hmagew3incomep5numplaw3momed_hsbcatholicc5r2mtsc_stddistanceweights
214145000.5100 0.59437750.18013601
414362500.5100 0.49061060.20929571
513887500.5101 1.45127790.21540221
814162500.5100 0.38519660.19978971
2904212500.5100-0.97211140.11526191
3013987500.5100 0.40125580.22038101
	Welch Two Sample t-test

data:  c5r2mtsc_std by catholic
t = 4.9879, df = 2685.1, p-value = 6.494e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1051808 0.2414485
sample estimates:
mean in group 0 mean in group 1 
      0.3829825       0.2096679 

Comments