zepid.causal.ipw.IPCW.IPCW

class zepid.causal.ipw.IPCW.IPCW(df, idvar, time, event, flat_df=False, enter=None)

Calculates inverse probability of censoring weights. Note that this function will accept either a flat file (one row per individual) or a long format (multiple rows per individual). If a flat file is provided, it must be converted to a long format. This will be done automatically if flat_df=True. Additionally, a warning and some comparison statistics are provided. Please verify that they match. In general, it is recommended to convert the data set yourself

IPCW are calculated via logistic regression and weights are cumulative products per unique ID. IPCW can be used to correct for missing at random data by the generated model in weighted Kaplan-Meier curves. The formula used to generate the unstabilized IPCW is

\[\pi_i(t) = \prod_{R_k \le t} \frac{1}{\Pr(C_i > R_k | \bar{L} = \bar{l}, C_i > R_{k-1})}\]

The stabilized IPCW substitutes predicted probabilities under the specified numerator model into the numerator of the previous equation. In general, it is recommended to stabilize IPCW by the time.

\[\pi_i(t) = \prod_{R_k \le t} \frac{\Pr(C_i > R_k)}{\Pr(C_i > R_k | \bar{L} = \bar{l}, C_i > R_{k-1})}\]

Note

IPCW no longer support late-entry. The reason is that the pooled logistic regression model approach does not correctly accumulate the weights. As such, either all occurrences of late-entries need to be dropped (called the new-user design) or rows need to be back-propagated (unobserved rows are filled in). The second approach requires filling in the missing observed covariates and for time-varying variables will require imputation. The new-user design is a safer bet and generally what I will currently recommend

Parameters:
  • df (DataFrame) – Pandas DataFrame object containing all the variables of interest
  • idvar (str) – String that indicates the column name for a unique identifier for each individual
  • time (str) – Column name for the ending observation time
  • event (str) – Column name for the event of interest
  • flat_df (bool, optional) – Whether the input dataframe only contains a single row per participant. If so, the flat dataframe is converted to a long dataframe. Default is False (for multiple rows per person)
  • enter (str, optional) – Time participant began being observed. Default is None. This option is only needed when flat_df=True. Late-entries are no longer supported and specifying this will lead to a ValueError

Example

Setting up the environment

>>> from zepid import load_sample_data
>>> from zepid.causal.ipw import IPCW
>>> df = load_sample_data(timevary=True)
>>> df['enter_q'] = df['enter'] ** 2
>>> df['enter_c'] = df['enter'] ** 3
>>> df['age0_q'] = df['age0'] ** 2
>>> df['age0_c'] = df['age0'] ** 3

Calculating stabilized IPCW with a long data set

>>> ipc = IPCW(df, idvar='id', time='enter', event='dead')
>>> ipc.regression_models(model_denominator='enter + enter_q + enter_c + male + age0 + age0_q + age0_c',
>>>                       model_numerator='enter + enter_q + enter_c')
>>> ipc.fit()

Extracting calculated stabilized IPCW

>>> ipc.Weight

Calculating stabilized IPCW with a wide data set

>>> df = load_sample_data(False)
>>> ipc = IPCW(df, idvar='id', time='t', event='dead', flat_df=True)
>>> ipc.regression_models(model_denominator='enter + enter_q + enter_c + male + age0 + age0_q + age0_c',
>>>                       model_numerator='enter + enter_q + enter_c')
>>> ipc.fit()

References

Howe CJ et al. (2016) Selection bias due to loss to follow up in cohort studies. Epidemiology, 27(1), 91-97.

__init__(df, idvar, time, event, flat_df=False, enter=None)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, idvar, time, event[, flat_df, …]) Initialize self.
fit() Calculates IPCW for each observation period for each observation.
regression_models(model_denominator, …[, …]) Regression model to generate predicted probabilities of censoring, conditional on specified variables.