zepid.causal.doublyrobust.TMLE.TMLE

class zepid.causal.doublyrobust.TMLE.TMLE(df, exposure, outcome, alpha=0.05, continuous_bound=0.0005)

Implementation of target maximum likelihood estimator. This implementation calculates TMLE for a time-fixed exposure and a single time-point outcome. By default standard parametric regression models are used to calculate the estimate of interest. The TMLE estimator allows users to instead use machine learning algorithms from sklearn and PyGAM.

Note

Valid confidence intervals are only attainable with certain machine learning algorithms. These algorithms must be Donsker class for valid confidence intervals. GAM and LASSO are examples of alogorithms that are Donsker class

Note

TMLE is a doubly-robust substitution estimator. TMLE obtains the target estimate in a single step. The single-step TMLE is described further by van der Laan. For further details, see the listed references.

Continuous outcomes must be bounded between 0 and 1. TMLE does this automatically for the user. Additionally, the average treatment effect is estimate is back converted to the original scale of Y. When scaling Y as Y*, some values may take the value of 0 or 1, which breaks a logit(Y*) transformation. To avoid this issue, Y* is bounded by the continuous_bound argument. The default is 0.0005, the same as R’s tmle

The following is a general outline of the estimation process for TMLE

1. Initial estimates for Y are predicted from a regression model. Expected values for each individual are generated under the scenarios of all treated vs all untreated

\[E(Y|A, L)\]
  1. Predicted probabilities are generated from a regression model
\[\pi_1 = \Pr(A=1|L)\]
  1. The ‘clever covariate’ is calculated by
\[H_a(A=a,L) = \frac{I(A=1)}{\pi_1} - \frac{I(A=0)}{\pi_0}\]

for each individual. Afterwards, the predicted Y is set as an offset in the following logit model and used to predict values under each treatment strategy after fitted

\[\text{logit}(E(Y|A,L)) = \text{logit}(Y_a) + \sigma H_a\]
  1. The targeted Psi is estimated, representing the causal effect of all treated vs. all untreated

Confidence intervals are constructed using influence curves.

Parameters:
  • df (DataFrame) – Pandas dataframe containing the variables of interest
  • exposure (str) – Column label for the exposure of interest
  • outcome (str) – Column label for the outcome of interest
  • alpha (float, optional) – Alpha for confidence interval level. Default is 0.05
  • continuous_bound (float, optional) – Optional argument to control the bounding feature for continuous outcomes. The bounding process may result in values of 0,1 which are undefined for logit(x). This parameter adds or substracts from the scenarios of 0,1 respectively. Default value is 0.0005

Examples

Setting up environment

>>> from zepid import load_sample_data, spline
>>> from zepid.causal.doublyrobust import TMLE
>>> df = load_sample_data(False).dropna()
>>> df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)

Estimating TMLE using logistic regression

>>> tmle = TMLE(df, exposure='art', outcome='dead')
>>> # Specifying exposure/treatment model
>>> tmle.exposure_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')
>>> # Specifying outcome model
>>> tmle.outcome_model('art + male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')
>>> # TMLE estimation procedure
>>> tmle.fit()
>>> # Printing main results
>>> tmle.summary()
>>> # Extracting risk difference and confidence intervals, respectively
>>> tmle.risk_difference
>>> tmle.risk_difference_ci

Estimating TMLE with machine learning algorithm from sklearn

>>> from sklearn.linear_model import LogisticRegression
>>> log1 = LogisticRegression(penalty='l1', random_state=201)
>>> tmle = TMLE(df, 'art', 'dead')
>>> # custom_model allows specification of machine learning algorithms
>>> tmle.exposure_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', custom_model=log1)
>>> tmle.outcome_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', custom_model=log1)
>>> tmle.fit()

Demonstration of estimating g-model with symmetric bounds

>>> tmle.exposure_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', bound=0.05)

Demonstration of estimating g-model with asymmetric bounds

>>> tmle.exposure_model('male + age0 + cd40 + cd4_rs1 + cd4_rs2 + dvl0', bound=[0.05, 0.9])

References

Schuler MS, and Sherri R. “Targeted maximum likelihood estimation for causal inference in observational studies.” American journal of epidemiology 185.1 (2017): 65-73.

Van der Laan, MJ, and Sherri R. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011.

Van Der Laan, MJ, Rubin D. “Targeted maximum likelihood learning.” The International Journal of Biostatistics 2.1 (2006).

Gruber S, van der Laan, MJ. (2011). tmle: An R package for targeted maximum likelihood estimation.

__init__(df, exposure, outcome, alpha=0.05, continuous_bound=0.0005)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, exposure, outcome[, alpha, …]) Initialize self.
exposure_model(model[, custom_model, bound, …]) Estimation of Pr(A=1|L), which is termed as g(A=1|L) in the literature
fit() Calculate the effect measures from the predicted exposure probabilities and predicted outcome values using the TMLE procedure.
missing_model(model[, custom_model, bound, …]) Estimation of Pr(M=1|A,L), which is the missing data mechanism for the outcome.
outcome_model(model[, custom_model, bound, …]) Estimation of E(Y|A,L,M=1), which is also written sometimes as Q(A,W,M=1) or Pr(Y=1|A,W,M=1).
plot_kde(to_plot[, bw_method, fill, color, …]) Generates density plots that can be used to check predictions qualitatively.
plot_love([color_unweighted, …]) Generates a Love-plot to detail covariate balance based on the IPTW weights.
positivity([decimal]) Use this to assess whether positivity is a valid assumption for the exposure model / calculated IPTW.
run_diagnostics([decimal]) Run all currently implemented diagnostics for the exposure and outcome models.
standardized_mean_differences() Calculates the standardized mean differences for all variables based on the inverse probability weights.
summary([decimal]) Prints summary of the estimated average causal effects