zepid.causal.doublyrobust.crossfit.DoubleCrossfitTMLE

class zepid.causal.doublyrobust.crossfit.DoubleCrossfitTMLE(df, exposure, outcome, alpha=0.05, continuous_bound=0.0005)

Implementation of the Targeted Maximum Likelihood Estimator with a double cross-fit procedure. The purpose of the cross-fit procedure is to all for non-Donsker nuisance function estimators. Some of machine learning algorithms are non-Donsker. In practice this means that confidence interval coverage can be incorrect when certain nuisance function estimators are used. Additionally, bias may persist as well. Cross-fitting is meant to alleviate this issue, therefore cross-fitting with a doubly-robust estimator is recommended when using machine learning.

DoubleCrossfitTMLE uses a double cross-fit, where the data set is paritioned into at least three non-overlapping split. The nuisance function estimators are then estimated in each split. The estimated nuisance functions are then used to predict values in a non-overlapping split. This decouple the nuisance function estimation from the data used to estimate it

Note

Because of the repetitions of the procedure are needed to reduce variance determined by a particular partition, it can take a long time to run this code.

Parameters:
  • df (DataFrame) – Pandas dataframe containing all necessary variables
  • exposure (str) – Label for treatment column in the pandas data frame
  • outcome (str) – Label for outcome column in the pandas data frame
  • alpha (float, optional) – Alpha for confidence interval level. Default is 0.05
  • continuous_bound (float, optional) – Optional argument to control the bounding feature for continuous outcomes. The bounding process may result in values of 0,1 which are undefined for logit(x). This parameter adds or substracts from the scenarios of 0,1 respectively. Default value is 0.0005

Examples

Setting up environment

>>> from sklearn.linear_model import LogisticRegression
>>> from zepid import load_sample_data
>>> from zepid.causal.doublyrobust import DoubleCrossfitTMLE
>>> df = load_sample_data(False).drop(columns='cd4_wk45').dropna()

Estimating the double cross-fit TMLE

>>> dctmle = DoubleCrossfitTMLE(df, exposure='art', outcome='dead')
>>> dctmle.exposure_model("male + age0 + cd40 + dvl0", estimator=LogisticRegression(solver='lbfgs'))
>>> dctmle.outcome_model("art + male + age0 + cd40 + dvl0", estimator=LogisticRegression(solver='lbfgs'))
>>> dctmle.fit(n_splits=5, n_partitions=100)
>>> dctmle.summary()

References

Zivich PN, & Breskin A. (2020). Machine learning for causal inference: on the use of cross-fit estimators. arXiv preprint arXiv:2004.10337.

Newey WK, Robins JR. (2018) “Cross-fitting and fast remainder rates for semiparametric estimation”. arXiv:1801.09138

Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, & Robins J. (2018). “Double/debiased machine learning for treatment and structural parameters”. The Econometrics Journal 21:1; pC1–C6

__init__(df, exposure, outcome, alpha=0.05, continuous_bound=0.0005)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, exposure, outcome[, alpha, …]) Initialize self.
exposure_model(covariates, estimator[, bound]) Specify the treatment nuisance model variables and estimator(s) to use.
fit([n_splits, n_partitions, method, …]) Runs the crossfit estimation procedure with the targeted maximum likelihood estimator.
outcome_model(covariates, estimator) Specify the outcome nuisance model variables and estimator(s) to use.
run_diagnostics([color]) Runs available diagnostics for the plots.
summary([decimal]) Prints summary of model results