zepid.superlearner.stackers.SuperLearner

class zepid.superlearner.stackers.SuperLearner(estimators, estimator_labels, folds=10, loss_function='L2', solver='nnls', bounds=1e-06, discrete=False, verbose=False)

SuperLearner is an implementation of the super learner algorithm, which is a generalized stacking algorithm. Super learner is an approach to combine multiple predictive functions into a singular predictive function that has performance at least as good as the best candidate estimator included (asymptotically). Additionally, it should be noted that super learner converges at the rate with which the best candidate estimator converges.

Briefly, super learner takes an input of candidate estimators for a function. Each of the estimators is run through a train-test cross-validation algorithm. From the candidate estimators, either the best overall performing candidate (discrete super learner) or a weighted combination of the algorithms is used as the updated predictive function.

Note

SuperLearner does not accept missing data. All missing data decisions have to occur prior to trying to use the SuperLearner procedure.

SuperLearner accepts estimators that are of the SciKit-Learn format. Specifically, the candidate estimators must follow the estimator.fit(X, y) and estimator.predict(X) format. Performance has currently been checked for sklearn, pygam, and the estimators included in zepid.superlearner. Please consider opening an issue on GitHub if you find Python libraries that are not supported (but follow the SciKit-Learn style).

Note

SuperLearner(discrete=True) returns predictions from the candidate estimator with the greatest coefficient. In the case of a tie, the first candidate estimator with the greatest coefficient is used (as per numpy.argmax behavior).

To compare performances easily, SuperLearner provides both Cross-Validated Error and the Relative Efficiency. The Cross-Validated Error calculation depends on the chosen loss function. For L2, the loss function is

\[\frac{1}{n} \sum_i (Y_i - \widehat{Y}_i)^2\]

For the negative-log-likelihood loss function,

\[\frac{1}{n} \sum_i Y_i \times \ln(\widehat{Y}_i) + (1-Y_i) \times \ln(1 - \widehat{Y}_i)\]

Relative efficiency is the Cross-Validated Error for the candidate estimator divided by the Cross-Validated Error for the chosen super learner.

Parameters:
  • estimators (list, array) – Candidate estimators. Must follow sklearn style and not be fit yet
  • estimator_labels (list, array) – Labels for the candidate estimators being included
  • folds (int, optional) – Number of folds to use during the cross-validation procedure. It is recommended to be between 10-20. The default value is 10-fold cross-validation.
  • loss_function (str, optional) – Loss function to use. Options include: L2, NLogLik. L2 should be used for continuous outcomes and NLogLik for binary outcomes
  • solver (str, optional) – Optimization algorithm to use to determine the super learner weights. Currently only Non-Negative Least Squares is available.
  • bounds (float, collection, optional) – Bounding to use for probability. The bounding prevents values of exactly 0 or 1, which will break the loss function evaluation. Default is 1e-6.
  • discrete (bool, optional) – Whether to use only the estimator with the greatest weight (discrete super learner). Default is False, which uses the super learner including all estimators
  • verbose (bool, optional) – Whether to print progress to the console as super learner is being fit. Default is False.

Examples

Setup the environment and data set

>>> import statsmodels.api as sm
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from zepid import load_sample_data
>>> from zepid.superlearner import EmpiricalMeanSL, StepwiseSL, SuperLearner
>>> fb = sm.families.family.Gaussian()
>>> fc = sm.families.family.Binomial()
>>> df = load_sample_data(False).dropna()
>>> X = np.asarray(df[['art', 'male', 'age0']])
>>> y = np.asarray(df['dead'])

SuperLearner for binary outcomes

>>> # Setting up estimators
>>> emp = EmpiricalMeanSL()
>>> log = LogisticRegression()
>>> step = StepwiseSL(family=fb, selection="backward", order_interaction=1)
>>> sl = SuperLearner(estimators=[emp, log, step], estimator_labels=["Mean", "Log", "Step"], loss_function='nloglik')
>>> fsl = sl.fit(X, y)
>>> fsl.summary()  # Summary of Cross-Validated Errors
>>> fsl.predict(X)  # Generating predicted values from super learner

SuperLearner for continuous outcomes

>>> emp = EmpiricalMeanSL()
>>> lin = LinearRegression()
>>> step = StepwiseSL(family=fc, selection="backward", order_interaction=1)
>>> sl = SuperLearner(estimators=[emp, log, step], estimator_labels=["Mean", "Lin", "Step"], loss_function='L2')
>>> fsl = sl.fit(X, y)
>>> fsl.summary()  # Summary of Cross-Validated Errors
>>> fsl.predict(X)  # Generating predicted values from super learner

Discrete Super Learner

>>> sl = SuperLearner([emp, log, step], ["Mean", "Lin", "Step"], loss_function='L2', discrete=True)
>>> sl.fit(X, y)

References

Van der Laan MJ, Polley EC, Hubbard AE. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).

Rose S. (2013). Mortality risk score prediction in an elderly population using machine learning. American Journal of Epidemiology, 177(5), 443-452.

Methods

fit(X, y) Fit SuperLearner given the variables X to predict y.
predict(X) Generate predictions using the fit SuperLearner.
summary() Prints the summary information for the fit SuperLearner to the console.
fit(X, y)

Fit SuperLearner given the variables X to predict y. These variables are directly passed to the candidate estimators. If there is any pre-processing to do outside of the estimator, please do so before passing to fit.

Parameters:
  • X (numpy.array) – Covariates to predict the target values
  • y (numpy.array) – Target values to predict
Returns:

Return type:

None

predict(X)

Generate predictions using the fit SuperLearner.

Parameters:X (numpy.array) – Covariates to generate predictions of y. Note that X should be in the same format as the X used during the fit() function
Returns:
Return type:numpy.array of predicted values using either discrete super learner or super learner
summary()

Prints the summary information for the fit SuperLearner to the console.

Returns:
Return type:None