zepid.causal.gformula.TimeVary.MonteCarloGFormula

class zepid.causal.gformula.TimeVary.MonteCarloGFormula(df, idvar, exposure, outcome, time_in, time_out, weights=None)

Time-varying implementation of the Monte Carlo g-formula. The Monte Carlo estimator is useful for survival data. For an extensive walkthrough of the Monte Carlo g-formula, see Keil et al. 2014 and other listed references. This implementation has four options for the treatment courses:

Options for treatments * all : all individuals are given treatment * none : no individuals are given treatment * natural : individuals retain their observed treatment * custom : create a custom treatment. When specifying this, the dataframe must be referred to as ‘g’ The

following is an example that selects those whose age is 25 or older and are females Ex) treatment=”((g[‘age0’]>=25) & (g[‘male’]==0))

Note

Custom treatments use a “magic-g” parameter. Internally, the g-formula implementation names the data set as g. Therefore, when using custom treatment specifications, the data set must be referred to as g when following the pandas selection syntax

Currently, only binary exposures and a binary outcomes are supported. Logistic regression models are used to predict exposures and outcomes via statsmodels. See http://zepid.readthedocs.io/en/latest/ for an example (highly recommended)

Parameters:
  • df (DataFrame) – Pandas dataframe containing the variables of interest
  • idvar (str) – ID column label
  • exposure (str) – Treatment column label
  • outcome (str) – Outcome column label
  • time_out (str) – End of follow-up period time column label
  • time_in (str) – Start of follow-up period time label
  • weights (str, optional) – Column label for weights. Default is None, which assumes every observations has the same weight (i.e. 1)

Notes

  1. Monte Carlo increases by time units of one. Input dataset should reflect this
  2. Only binary exposures and binary outcomes are supported
  3. Binary and continuous covariates are supported
  4. The labeling of the covariate models is important. They are fit in the order that they are labeled!
  5. Fit the natural course model first and compare to the observed data. They should be similar

Process for the Monte Carlo g-formula

  1. run lines in “in_recode”

  2. time-varying covariates, order ascending in from “labels”

    1. predict time-varying covariate
    2. run lines in “recode” from “add_covariate_model()”
  3. predict exposure / apply exposure pattern

  4. predict outcome

  5. run lines in “out_recode”

  6. lag variables in “lags”

  7. append current-time rows to full dataframe

  8. Repeat till t_max is met

Examples

Setting up the environment

>>> import numpy as np
>>> from zepid import load_sample_data, spline
>>> from zepid.causal.gformula import MonteCarloGFormula
>>> df = load_sample_data(timevary=True)
>>> df['lag_art'] = df['art'].shift(1)
>>> df['lag_art'] = np.where(df.groupby('id').cumcount() == 0, 0, df['lag_art'])
>>> df['lag_cd4'] = df['cd4'].shift(1)
>>> df['lag_cd4'] = np.where(df.groupby('id').cumcount() == 0, df['cd40'], df['lag_cd4'])
>>> df['lag_dvl'] = df['dvl'].shift(1)
>>> df['lag_dvl'] = np.where(df.groupby('id').cumcount() == 0, df['dvl0'], df['lag_dvl'])
>>> df[['age_rs0', 'age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=4, term=2, restricted=True)
>>> df['cd40_sq'] = df['cd40'] ** 2
>>> df['cd40_cu'] = df['cd40'] ** 3
>>> df['cd4_sq'] = df['cd4'] ** 2
>>> df['cd4_cu'] = df['cd4'] ** 3
>>> df['enter_sq'] = df['enter'] ** 2
>>> df['enter_cu'] = df['enter'] ** 3

Estimating the g-formula with the Monte Carlo estimator

>>> g = MonteCarloGFormula(df, idvar='id', exposure='art', outcome='dead', time_in='enter', time_out='out')
>>> # Specifying the exposure/treatment model
>>> exp_m = 'male + age0 + age_rs0 + age_rs1 + age_rs2 + cd40 + cd40_sq + cd40_cu + dvl0 + cd4 + cd4_sq +' +
>>>         'cd4_cu + dvl + enter + enter_sq + enter_cu'
>>> g.exposure_model(exp_m, restriction="g['lag_art']==0")  # restriction enforces intent-to-treat
>>> # Specifying the outcome model
>>> out_m = 'art + male + age0 + age_rs0 + age_rs1 + age_rs2 + cd40 + cd40_sq + cd40_cu + dvl0 + cd4 +' +
>>>         'cd4_sq + cd4_cu + dvl + enter + enter_sq + enter_cu'
>>> g.outcome_model(out_m, restriction="g['drop']==0")  # restriction enforces loss-to-follow-up
>>> # Specifying the time-varying confounder models
>>> dvl_m = 'male + age0 + age_rs0 + age_rs1 + age_rs2 + cd40 + cd40_sq + cd40_cu + dvl0 + lag_cd4 + ' +
>>>         'lag_dvl + lag_art + enter + enter_sq + enter_cu'
>>> g.add_covariate_model(label=1, covariate='dvl', model=dvl_m, var_type='binary')
>>> cd4_m = 'male + age0 + age_rs0 + age_rs1 + age_rs2 +  cd40 + cd40_sq + cd40_cu + dvl0 + lag_cd4 + ' +
>>>         'lag_dvl + lag_art + enter + enter_sq + enter_cu'
>>> cd4_recode_scheme = ("g['cd4'] = np.maximum(g['cd4'],1);"  # Recode scheme makes sure variables are recoded
>>>                      "g['cd4_sq'] = g['cd4']**2;"
>>>                      "g['cd4_cu'] = g['cd4']**3")
>>> g.add_covariate_model(label=2, covariate='cd4', model=cd4_m, recode=cd4_recode_scheme, var_type='continuous')
>>> # Specifying a model for informative censoring
>>> cens_m = "male + age0 + age_rs0 + age_rs1 + age_rs2 +  cd40 + cd40_sq + cd40_cu + dvl0 + lag_cd4 +" +
>>>          "lag_dvl + lag_art + enter + enter_sq + enter_cu"
>>> g.censoring_model(cens_m)
>>> # Estimating outcomes under a simulated Markov Chain Monte Carlo for natural course
>>> g.fit(treatment="((g['art']==1) | (g['lag_art']==1))",  # Treatment plan (natural course in this case)
>>>       lags={'art': 'lag_art',  # Creating variables to lag in the process
>>>             'cd4': 'lag_cd4',
>>>             'dvl': 'lag_dvl'},
>>>       sample=50000,  # Number of resamples to use (should be large number to reduce simulation error)
>>>       t_max=None,  # Maximum time to simulate to (None uses data set maximum time)
>>>       in_recode=("g['enter_sq'] = g['enter']**2;"
>>>                  "g['enter_cu'] = g['enter']**3"))  # How to recode time in each time-step
>>> # See website documentation for further instructions
>>> # (https://zepid.readthedocs.io/en/latest/Causal.html#g-computation-algorithm-monte-carlo)

References

Keil, AP, Edwards, JK, Richardson, DB, Naimi, AI, Cole, SR (2014). The Parametric g-Formula for Time- to-Event Data: Intuition and a Worked Example. Epidemiology 25(6), 889-897

__init__(df, idvar, exposure, outcome, time_in, time_out, weights=None)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, idvar, exposure, outcome, …) Initialize self.
add_covariate_model(label, covariate, model) Add a specified regression model for time-varying confounders.
censoring_model(model[, restriction, …]) Add a specified regression model for censoring.
exposure_model(model[, restriction, …]) Add a specified regression model for the exposure.
fit(treatment[, lags, sample, t_max, …]) Estimate the counterfactual outcomes under the specified treatment plan using the previously specified regression models.
outcome_model(model[, restriction, …]) Add a specified regression model for the outcome.