zepid.causal.doublyrobust.crossfit.DoubleCrossfitAIPTW

class zepid.causal.doublyrobust.crossfit.DoubleCrossfitAIPTW(df, exposure, outcome, alpha=0.05)

Implementation of the augmented inverse probability weighted estimator with a double cross-fit procedure. The purpose of the cross-fit procedure is to all for non-Donsker nuisance function estimators. Some of machine learning algorithms are non-Donsker. In practice this means that confidence interval coverage can be incorrect when certain nuisance function estimators are used. Additionally, bias may persist as well. Cross-fitting is meant to alleviate this issue, therefore cross-fitting with a doubly-robust estimator is recommended when using machine learning.

DoubleCrossfitAIPTW allows for double cross-fitting, where the data set is partitioned into at least three non-overlapping splits. The nuisance function estimators are then estimated in each split. The estimated nuisance functions are then used to predict values in the opposing split. Different splits are used for each nuisance function. A double cross-fit procedure further de-couples the nuisance function estimation compared to single cross-fit procedures.

Note

Because of the repetitions of the procedure are needed to reduce variance determined by a particular partition, it can take a long time to run this code. On a data set of 3000 observations with 100 different partitions it takes about an hour. The advantage is that the code can be ran in parallel. See the documentation for an example.

Parameters:
  • df (DataFrame) – Pandas dataframe containing all necessary variables
  • exposure (str) – Label for treatment column in the pandas data frame
  • outcome (str) – Label for outcome column in the pandas data frame
  • alpha (float, optional) – Alpha for confidence interval level. Default is 0.05

Examples

Setting up environment

>>> from sklearn.linear_model import LogisticRegression
>>> from zepid import load_sample_data
>>> from zepid.causal.doublyrobust import SingleCrossfitAIPTW
>>> df = load_sample_data(False).drop(columns='cd4_wk45').dropna()

Estimating the double cross-fit AIPTW

>>> dcaipw = DoubleCrossfitAIPTW(df, exposure='art', outcome='dead')
>>> dcaipw.exposure_model("male + age0 + cd40 + dvl0", estimator=LogisticRegression(solver='lbfgs'))
>>> dcaipw.outcome_model("art + male + age0 + cd40 + dvl0", estimator=LogisticRegression(solver='lbfgs'))
>>> dcaipw.fit(n_splits=5, n_partitions=100)
>>> dcaipw.summary()

References

Newey WK, Robins JR. (2018) “Cross-fitting and fast remainder rates for semiparametric estimation”. arXiv:1801.09138

Zivich PN, & Breskin A. (2020). Machine learning for causal inference: on the use of cross-fit estimators. arXiv preprint arXiv:2004.10337.

Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, & Robins J. (2018). “Double/debiased machine learning for treatment and structural parameters”. The Econometrics Journal 21:1; pC1–C6

__init__(df, exposure, outcome, alpha=0.05)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, exposure, outcome[, alpha]) Initialize self.
exposure_model(covariates, estimator[, bound]) Specify the treatment nuisance model variables and estimator(s) to use.
fit([n_splits, n_partitions, method, …]) Runs the crossfit estimation procedure with augmented inverse probability weighted estimator.
outcome_model(covariates, estimator) Specify the outcome nuisance model variables and estimator(s) to use.
run_diagnostics([color]) Runs available diagnostics for the plots.
summary([decimal]) Prints summary of model results