zepid.causal.ipw.IPCW.IPCW¶
-
class
zepid.causal.ipw.IPCW.
IPCW
(df, idvar, time, event, flat_df=False, enter=None)¶ Calculates inverse probability of censoring weights. Note that this function will accept either a flat file (one row per individual) or a long format (multiple rows per individual). If a flat file is provided, it must be converted to a long format. This will be done automatically if flat_df=True. Additionally, a warning and some comparison statistics are provided. Please verify that they match. In general, it is recommended to convert the data set yourself
IPCW are calculated via logistic regression and weights are cumulative products per unique ID. IPCW can be used to correct for missing at random data by the generated model in weighted Kaplan-Meier curves. The formula used to generate the unstabilized IPCW is
\[\pi_i(t) = \prod_{R_k \le t} \frac{1}{\Pr(C_i > R_k | \bar{L} = \bar{l}, C_i > R_{k-1})}\]The stabilized IPCW substitutes predicted probabilities under the specified numerator model into the numerator of the previous equation. In general, it is recommended to stabilize IPCW by the time.
\[\pi_i(t) = \prod_{R_k \le t} \frac{\Pr(C_i > R_k)}{\Pr(C_i > R_k | \bar{L} = \bar{l}, C_i > R_{k-1})}\]Note
IPCW no longer support late-entry. The reason is that the pooled logistic regression model approach does not correctly accumulate the weights. As such, either all occurrences of late-entries need to be dropped (called the new-user design) or rows need to be back-propagated (unobserved rows are filled in). The second approach requires filling in the missing observed covariates and for time-varying variables will require imputation. The new-user design is a safer bet and generally what I will currently recommend
Parameters: - df (DataFrame) – Pandas DataFrame object containing all the variables of interest
- idvar (str) – String that indicates the column name for a unique identifier for each individual
- time (str) – Column name for the ending observation time
- event (str) – Column name for the event of interest
- flat_df (bool, optional) – Whether the input dataframe only contains a single row per participant. If so, the flat dataframe is converted to a long dataframe. Default is False (for multiple rows per person)
- enter (str, optional) – Time participant began being observed. Default is None. This option is only needed when flat_df=True. Late-entries are no longer supported and specifying this will lead to a ValueError
Example
Setting up the environment
>>> from zepid import load_sample_data >>> from zepid.causal.ipw import IPCW >>> df = load_sample_data(timevary=True) >>> df['enter_q'] = df['enter'] ** 2 >>> df['enter_c'] = df['enter'] ** 3 >>> df['age0_q'] = df['age0'] ** 2 >>> df['age0_c'] = df['age0'] ** 3
Calculating stabilized IPCW with a long data set
>>> ipc = IPCW(df, idvar='id', time='enter', event='dead') >>> ipc.regression_models(model_denominator='enter + enter_q + enter_c + male + age0 + age0_q + age0_c', >>> model_numerator='enter + enter_q + enter_c') >>> ipc.fit()
Extracting calculated stabilized IPCW
>>> ipc.Weight
Calculating stabilized IPCW with a wide data set
>>> df = load_sample_data(False) >>> ipc = IPCW(df, idvar='id', time='t', event='dead', flat_df=True) >>> ipc.regression_models(model_denominator='enter + enter_q + enter_c + male + age0 + age0_q + age0_c', >>> model_numerator='enter + enter_q + enter_c') >>> ipc.fit()
References
Howe CJ et al. (2016) Selection bias due to loss to follow up in cohort studies. Epidemiology, 27(1), 91-97.
-
__init__
(df, idvar, time, event, flat_df=False, enter=None)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(df, idvar, time, event[, flat_df, …])Initialize self. fit
()Calculates IPCW for each observation period for each observation. regression_models
(model_denominator, …[, …])Regression model to generate predicted probabilities of censoring, conditional on specified variables.