zepid.causal.ipw.IPMW.IPMW

class zepid.causal.ipw.IPMW.IPMW(df, missing_variable, stabilized=False, monotone=True)

Calculates inverse probability of missing weights. IPMW automatically codes a missingness indicator (based on np.nan), so data can be directly input, without creation of missingness indicator before inputting data

The formula for stabilized IPMW is

\[\pi_i = \frac{\Pr(M=0)}{\Pr(M=0|L=l)}\]

where M=0 indicates observed data. For unstabilized IPMW

\[\pi_i = \frac{1}{\Pr(M=0|L=l)}\]

IPMW currently supports weights for a single missing variable, or a list of variables that are monotonically missing. For data to be missing monotonically, there is some ordering of the variables with missing data such that the previous variable must be observed for the later to be observed. A simple example is censoring in longitudinal data without late entry. To be observed at time t, the individual must be observed at time t-1

For multiple variables with missing data, IPMW determines if the two variables are uniform missing. This is a special case of monotonic missing data. As a result, IPMW will only need to calculate IPMW for one of the variables. See the references for further details on this

Parameters:
  • df (DataFrame) – Pandas Dataframe object containing all variables of interest
  • missing_variable (str, list) – Column name for missing data. numpy.nan values should indicate missing observations. For multiple missing variables, a list of strings (indicating column labels) can be added
  • stabilized (bool, optional) – Whether to return the stabilized or unstabilized IPMW. Default is to return unstabilized weights
  • monotone (bool, optional) – Whether missing data is monotonic or nonmonotonic. This option is only used for when multiple missing variables are provided. monotone=False will give an error (for now)

Note

Nonmonotonic missing data is arguably more common in practice. Sun and Tchetgen Tchetgen recently proposed a way to estimate IPMW under nonmonotonic missing data. I plan on implementing this in a future release. Until then IPMW only supports monotonic missing data

Examples

Setting up the environment

>>> from zepid import load_sample_data, load_monotone_missing_data
>>> from zepid.causal.ipw import IPMW
>>> df = load_sample_data(timevary=False)

Calculating unstabilized Inverse Probability of Missingness Weights

>>> ipm = IPMW(df, missing='dead', stabilized=False)
>>> ipm.regression_models(model_denominator='age0 + art + male')
>>> ipm.fit()

Extracting calculated weights

>>> ipm.Weight

Calculating IPMW for monotone missing variables

>>> df = load_monotone_missing_data()
>>> ipm = IPMW(df, missing_variable=['B', 'C'], monotone=True)
>>> ipm.regression_models(model_denominator=['L + A', 'L + B'])
>>> ipm.fit()
>>> ipm.Weight

References

Sun B, et al. (2017). Inverse-probability-weighted estimation for monotone and nonmonotone missing data. American Journal of Epidemiology, 187(3), 585-591.

Perkins, NJ et al. (2017). Principled approaches to missing data in epidemiologic studies. American Journal of Epidemiology, 187(3), 568-575.

Li L, Shen C, Li X, Robins JM. (2013). On weighting approaches for missing data. Statistical Methods in Medical Research, 22(1), 14-30.

Greenland S, & Finkle WD. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. American journal of epidemiology, 142(12), 1255-1264.

Seaman SR., White IR. (2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3), 278-295.

__init__(df, missing_variable, stabilized=False, monotone=True)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(df, missing_variable[, stabilized, …]) Initialize self.
fit() Calculates the IPMW based on the predicted probabilities from the fitted logistic regression models.
regression_models(model_denominator[, …]) Regression model to generate predicted probabilities of censoring, conditional on specified variables.