Introduction¶
Data Scientists need to help making business decisions by providing insights from data. To answer to the questions like "does a new feature improve user engagement?", data scientists may conduct A/B testing to see if there is any "causal" effect of new feature on user's engagement, evaluated with certain metric. Before diving into causal inference in observational study, let's talk about more common approach: A/B testing and its limitation.
A/B testing¶
In my opinion, A/B testing is a rebranded version of traditional experimental design in IT industry to find statistically significant causal effect. In clinical trials, experimental design is used to assess whether there is a significant improvement in using new drug in comparisons to current status quo. A/B testing also takes random subset of target population group and randomly assign the users into treatment (new feature) and control groups to see if user experience improves by the new features. Here, the random assignment of subjects to treatment and control groups takes the key role to make causal statement. The random assignments assures that there is no confounding factors.
Let's think more details on this "random assignment". Suppose we do not randomly assign subjects to treatment or placebo groups, and let the subjects choose to take treatments with their own will. Suppose also that t-test found that the mean of treatment group has significant increase in its health status than mean of those who do not take the treatment. Can we conclude that the treatment has statistically significant effect on improving the health status? Yes. But is it causal effect?
The answer is no, because of confounding factors. For example, those who choose to take treatments happened to be more health-conscious, or happened to be healthier from the beginning to afford the pain of the treatment. These confounding effects, that are highly correlated to the treatment group subjects, may be the true root cause of the better health status.
To average out these confounding effects, A/B testing needs to randomize subjects. We hope that by randomly allocating individuals to treatment and placebo groups, both group has same level of Health consciousness or the same level of health status at the beginning "on average".
Observational study¶
We understand that A/B testing is useful to find statistically significant causal effect. But there are many scenario where we cannot do A/B testing. For example, some features cannot be A/B tested because of some engineering constraints. Or some features cannot be A/B tested because company has no control over the feature. E.g. what if LinkedIn wants to test the effect of profile picture on the hiring decision, LinkedIn cannot randomly assign subset of individuals to have no profile pictures! In such scenario, how can we eliminate confounding factors and make causal analysis?
In this blog, we learn:
Propensity Score Matching (PCM) technique, one of the most common technique in observational study to do causal inference.
How to analyze data with PCM in R.
Propensity Score Matching¶
PSM attempts to average out the confounding factors by making the groups receiving treatment and not-treatment comparable with respect to the potential confounding variables. The propensity score measures the probability of a subject to be in treatment group, and it is calculated using the potential confounding variables. If the distribution of the propensity scores are similar between treatment and placebo, we can say that the confounding factors are averaged out.
PCM tries to make the distribution of PCM the same between treatment and non-treatment group by matching each subject in treatment group with another subject in non-treatment group in terms of the propensity score. It may happen that the same subject in treatment to be matched with multiple subjects in placebo (Upsampling), or some subjects may not be used for matching and hence discarded from the analysis (Downsampling).
PSM procedure¶
Here is the general PSM procedure.
Step 1. Run logistic regression:
- Dependent variable: Z = 1, if a subject is in treatment group; Z = 0, otherwise.
- Independent variable confounding factors. The probability estimate of this logistic regression is propensity score.
Step 2. Match each participant to one or more nonparticipants on propensity score by nearest neighbor matching, exact matching or other matching techniques.
Step 3. Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample.
Hands-on PSM analysis¶
Now, we are ready to analyze observational study data. We will closely follow the seminal tutorial R Tutorial 8: Propensity Score Matching.
This tutorial analyzes the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.
First, follow the link above to download the data ecls.csv
.
In this data,
- indepdent variable:
catholic
(1 = student went to catholic school; 0 = student went to public school) - dependent variable:
c5r2mtsc_std
students’ standardized math score
library(MatchIt)
library(dplyr)
library(ggplot2)
ecls <- read.csv("ecls/data-processed/ecls.csv")
cat("dim:",dim(ecls))
head(ecls)
Let's look at the distribution of the standardized math score for catholic vs public school individuals. The boxplot shows that the median of the two groups seem quite different.
boxplot(c5r2mtsc_std ~ catholic, data = ecls,
xlab="catholic",
ylab="standardized math score")
Ignoring confounding factors, let's see if there is any significant difference between the catholic and pulic school in terms of the mean of standardized math scores.
I use Welch's two sample T-test, the most general form of T-test which does not assume equal sample size or same variance across the two groups.
t.test(c5r2mtsc_std ~ catholic,data=ecls)
The t-test shows that the mean of the standardized test results are significantly different. But does going to catholic school caused the better performance on the standardized test? Let's consider confounding factors!
Following R Tutorial 8: Propensity Score Matching, I will use the following variables as potential confounding factors to model propensity score.
- race_white: Is the student white (1) or not (0)?
- p5hmage: Mother’s age
- w3income: Family income
- p5numpla: Number of places the student has lived for at least 4 months
- w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
ecls_cov <- c('race_white', 'p5hmage', 'w3income', 'p5numpla', 'w3momed_hsb')
ecls %>%
group_by(catholic) %>%
select(one_of(ecls_cov)) %>%
summarise_all(funs(mean(., na.rm = T)))
The means of all these factors are significantly different between the catholic and public schools.
print(ecls_cov)
lapply(ecls_cov,
function(x) {
t.test(ecls[,x] ~ ecls$catholic)
})
Propensity score estimation¶
To calculate propensity score and find the match, we will use matchit
function.
The propensity score is calculated by fitting logistic regression with the potential confounding factors as independent variables and the school type as a dependent variable.
The logistic regression fit can also be done using glm
with family="binomial"
.
To find the match, I will use nearest neighbor matching.
# matchit does not allow missing values in data
variable_in_use <- c(ecls_cov,"catholic","c5r2mtsc_std")
omitTF <- apply(ecls[,variable_in_use],1,
function(x)any(is.na(x)))
ecls_nomiss <- ecls[!omitTF,variable_in_use]
mod_match <- matchit(catholic~ race_white + p5hmage + w3income + p5numpla + w3momed_hsb,
data=ecls_nomiss,method = "nearest")
Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample.¶
Summary shows that the mean of each of the confounding variables is less than one SD away between the catholic and public schools.
"Summary of balance for all data" simply shows the mean of catholic and public schools. We already studied these in the Welch's T-test above. What interesting is that the Mean Diff of the confounding variables substantially reduced when data is matched. See Summary of balance for matched data. It seems that the matching is well done.
summary(mod_match)
Plot function of matchit allows calculating empirical quantile-quantile plots plots of each covariate to check balance of marginal distributions when type = "QQ". Notice that the distribution of covariates between catholic (y-axis) and public (x-axis) becomes more comparable after data is matched.
With type="hist", the plot outputs histograms of the propensity score in the original treated and control groups. After data is matched, the propensity score distribution between the catholic and public school become more comparable.
These observations indicate that the matched samples averaged out the potential confounding factors.
plot(mod_match,type="QQ")
plot(mod_match,type="hist")
Back to Welch's T-test¶
Now perform Welch's T-test again with matched sample. Do we still have significant p-value? Yes. We do. However, it shows that going to catholic school actually negatively affects the standardized test after controlling confounding. This analysis indicates that going to catholic schools is not the root cause of better performance on standardized test and it could actually hurt the performance on the standardized test!
dta_m <- match.data(mod_match) ## obtain the dataframe of matched samples.
head(dta_m)
with(dta_m, t.test(c5r2mtsc_std ~ catholic))