7 min read

Using machine learning for causal effect in observational study

A simulation for an OLS model

In an observational study, we need to assume we have the functional form to get causal effect estimated correctly, in addtion to the assumption of treatment being exogenous.

library(MASS)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tmle)
## Loading required package: SuperLearner
## Loading required package: nnls
## Super Learner
## Version: 2.0-22
## Package created on 2016-12-10
## Welcome to the tmle package, version 1.2.0-2
## 
## Use tmleNews() to see details on changes and bug fixes
## 
## Attaching package: 'tmle'
## The following object is masked from 'package:SuperLearner':
## 
##     SL.glm.interaction
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
set.seed(366)

nobs <- 2000
xw <- .8
xz <- .5
zw <- .6
nrow <- 3
ncol <- 3
covarMat = matrix( c(1^2, xz^2, xw^2, xz^2, 1^2, zw^2,  xw^2, zw^2, 1^2 ) , nrow=ncol , ncol=ncol )

mu <- rep(0,3)
rawvars <- mvrnorm(n=nobs, mu=mu, Sigma=covarMat)
df <- tbl_df(rawvars)
names(df) <- c('x','z','w')
df <- df %>%
    mutate(log.x=log(x^2), log.z=log(z^2), log.w=log(w^2), z.sqr=z^2, w.sqr=w^2) %>%
    mutate(g.var= log.w  + rnorm(nobs)) %>%
    mutate(A = rbinom(nobs, 1, 1/(1+exp((g.var))))) %>%
    mutate(y0=rnorm(nobs) + log.x) %>%
    mutate(tau.true = 2  + rnorm(nobs), y1=y0+tau.true, treat=A, y = treat*y1 + (1-treat)*y0)
lm1 <- lm(y ~ A + log.w + log.x , data=df)
summary(lm1)
## 
## Call:
## lm(formula = y ~ A + log.w + log.x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5506 -0.8372 -0.0154  0.8502  4.1624 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01267    0.04778   0.265    0.791    
## A            1.93171    0.06580  29.355   <2e-16 ***
## log.w        0.01105    0.01428   0.774    0.439    
## log.x        1.00162    0.01353  74.030   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.255 on 1996 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7576 
## F-statistic:  2084 on 3 and 1996 DF,  p-value: < 2.2e-16
lm2 <- lm(y ~ A , data=df)
summary(lm2)
## 
## Call:
## lm(formula = y ~ A, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9102  -1.3822   0.3241   1.6933   6.4046 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.92139    0.08992  -10.25   <2e-16 ***
## A            1.38964    0.11366   12.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 1998 degrees of freedom
## Multiple R-squared:  0.06961,    Adjusted R-squared:  0.06915 
## F-statistic: 149.5 on 1 and 1998 DF,  p-value: < 2.2e-16
lm3 <- lm(y ~ A + w, data=df)
summary(lm3)
## 
## Call:
## lm(formula = y ~ A + w, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9112  -1.3795   0.3139   1.6863   6.3468 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.92056    0.08995 -10.234   <2e-16 ***
## A            1.38860    0.11368  12.215   <2e-16 ***
## w           -0.03351    0.05286  -0.634    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 1997 degrees of freedom
## Multiple R-squared:  0.0698, Adjusted R-squared:  0.06887 
## F-statistic: 74.93 on 2 and 1997 DF,  p-value: < 2.2e-16
lm4 <- lm(y ~ A + w + x, data=df)
summary(lm4)
## 
## Call:
## lm(formula = y ~ A + w + x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9109  -1.4047   0.3177   1.6926   6.4528 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.91759    0.08995 -10.201   <2e-16 ***
## A            1.38335    0.11372  12.164   <2e-16 ***
## w           -0.09260    0.06827  -1.356    0.175    
## x            0.09947    0.07275   1.367    0.172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 1996 degrees of freedom
## Multiple R-squared:  0.07067,    Adjusted R-squared:  0.06927 
## F-statistic:  50.6 on 3 and 1996 DF,  p-value: < 2.2e-16

In this example, treatment assignment process is determined by logged w, and outcome is dtermined by logged x and treatment. However, what we observe is w and x. In observational studies, this happens all the time. In fact, this is an ideal situation, that we observe variables that are determinants of outcome, although we are not sure about the functional form that determines the outcome. However, this example shows that unless we have observed exactly the factors themselves (in this case logged x, w, which determines the DGP), we have biased estimates of the true treatment effect.

Model 1 is the only model with reasonable estimate of treatment effect (which is 2 in this case). Model 2 is a model with endogeneity: A is correlated with the missing variabel logged x. Model 3 and 4 we have x and w, but not logged, therefore still biased.

The lesson here is the functional form does matter. However, we have no way of knowing the functional form. What can we do here?

Q.SL.library <- c("SL.randomForest", "SL.glmnet","SL.loess","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.gam","SL.gbm","SL.step","SL.mean")
g.SL.library <- c("SL.randomForest", "SL.glmnet","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.gam","SL.gbm","SL.step","SL.mean")

# this one is good since both Q and g are correct (including z in it)
tmle1 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
## Loading required package: arm
## Loading required package: lme4
## 
## arm (Version 1.8-6, built: 2015-7-7)
## Working directory is /home/xao/projects/blog/content/post
## Loading required package: gam
## Loading required package: splines
## Loaded gam 1.12
## Loading required package: gbm
## Loading required package: survival
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: rpart
tmle1
##  Additive Effect
##    Parameter Estimate:  1.9989
##    Estimated Variance:  0.0026151
##               p-value:  <2e-16
##     95% Conf Interval: (1.8987, 2.0992)
tmle2 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w', 'z')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
tmle2
##  Additive Effect
##    Parameter Estimate:  2.013
##    Estimated Variance:  0.0027391
##               p-value:  <2e-16
##     95% Conf Interval: (1.9104, 2.1156)

We use [Mark van der Laan’s TMLE method] (http://biostats.bepress.com/ucbbiostat/paper275/). It uses [SuperLearner] (http://biostats.bepress.com/ucbbiostat/paper222/) as the initial estimator. It’s an ensemble of mulitple machine learning algorithms. Therefore it does not need to assume the functional form of the DGP. Even if we don’t have the variables that determines the DGP of outcome, if we observe some functions (even nonlinear functions) of these variables, we can still get reasonable estimates of the treatment effect.

In this example, we used multiple popular machine learning algorithms in modeling both treatment assingment process and the outcome process. The first TMLE model is with x and w (note not the logged x and w which are in the true DGP), the second one with an additional variable z.

It seems that TMLE results are less biased than the linear models with x and w. It may not be better than the linear model with logged x and w, but in empirical studies, we often cannot assume we have the variables in the DGP, but only some proxy of the variables in the DGP. I’ll do more simulations to see whether TMLE does perform better in the situation that we are not sure about the functional form. We should expect that is the case.

So far TMLE can only be used when treatment is binary variable.

It’s about time we embrace machine learning techniques into studies of caual effect in observational studies.