zepid.causal.gformula.TimeFixed.SurvivalGFormula

class zepid.causal.gformula.TimeFixed.SurvivalGFormula(df, idvar, exposure, outcome, time, weights=None)

G-formula for time-to-event data where the exposure is fixed at baseline. Only supports binary exposures and outcomes. Outcomes are predicted using a logistic regression model. Input data set should be in a long format, where each row corresponds to an individual observed for one unit of time

Key options for treatments:

  • ‘all’ -all individuals are given treatment
  • ‘none’ -no individuals are given treatment
  • custom treatments -create a custom treatment. When specifying this, the dataframe must be referred to as ‘g’. The following is an example that selects those whose age is 30 or younger and are females: treatment="((g['age0']<=30) & (g['male']==0))
Parameters:
  • df (DataFrame) – Pandas dataframe containing the variables of interest
  • idvar (str) – Column name for the ID label
  • exposure (str, list) – Column name for exposure variable label or a list of disjoint indicator exposures
  • outcome (str) – Column name for outcome variable
  • time (str) – Column name for time variable
  • weights (str, optional) – Column name for weights. Default is None, which assumes every observations has the same weight (i.e. 1)

Note

Custom treatments use a “magic-g” parameter. Internally, the g-formula implementation names the data set as g. Therefore, when using custom treatment specifications, the data set must be referred to as g when following the pandas selection syntax

Examples

Setting up data in long format

>>> from zepid import load_sample_data
>>> from zepid.causal.gformula import SurvivalGFormula
>>> import matplotlib.pyplot as plt
>>> df = load_sample_data(False).drop(columns=['cd4_wk45'])
>>> df['t'] = np.round(df['t']).astype(int)
>>> df = pd.DataFrame(np.repeat(df.values, df['t'], axis=0), columns=df.columns)
>>> df['t'] = df.groupby('id')['t'].cumcount() + 1
>>> df.loc[((df['dead'] == 1) & (df['id'] != df['id'].shift(-1))), 'd'] = 1
>>> df['d'] = df['d'].fillna(0)
>>> df['t_sq'] = df['t']**2
>>> df['t_cu'] = df['t']**3

Estimating the time-to-event mean effect under treat-all plan

>>> sgf = SurvivalGFormula(df.drop(columns=['dead']), idvar='id', exposure='art', outcome='d', time='t')
>>> sgf.outcome_model(model='art + male + age0 + cd40 + dvl0 + t + t_sq + t_cu')
>>> sgf.fit(treatment='all')
>>> print(sgf.marginal_outcome)

Plotting cumulative incidence function

>>> sgf.plot(color='r')
>>> plt.show()

Estimating the time-to-event mean effect under treat-none plan

>>> sgf = SurvivalGFormula(df.drop(columns=['dead']), idvar='id', exposure='art', outcome='d', time='t')
>>> sgf.outcome_model(model='art + male + age0 + cd40 + dvl0 + t + t_sq + t_cu')
>>> sgf.fit(treatment='none')

Estimating the time-to-event mean effect under custom treatment plan

>>> sgf = SurvivalGFormula(df.drop(columns=['dead']), idvar='id', exposure='art', outcome='d', time='t')
>>> sgf.outcome_model(model='art + male + age0 + cd40 + dvl0 + t + t_sq + t_cu')
>>> sgf.fit(treatment="((g['age0']>=25) & (g['male']==0))")

Notes

The following process is used to estimate the cumulative incidence function. (1) A pooled logistic regression model is fit to the data. The model should predict the outcome conditional on treatment, baseline confounders, and time. Time should be modeled using flexible functional forms (e.g. splines) (2) Survival probabilities are estimated by predicting values at each time from the pooled logistic model and taking the cumulative product. The survival probabilities are predicted under the treatment plan of interest (3) Average the cumulative incidence function for each time period from all the subjects.

References

Hernán MA. (2010). The hazards of hazard ratios. Epidemiology, 21(1), 13–15. doi:10.1097/EDE.0b013e3181c1ea43

__init__(df, idvar, exposure, outcome, time, weights=None)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, idvar, exposure, outcome, time) Initialize self.
fit(treatment) Fit the parametric g-formula for time-to-event data.
outcome_model(model[, print_results]) Build the pooled logistic model.
plot(**plot_kwargs) Plots the estimated cumulative incidence function