zepid.superlearner.stackers.SuperLearner¶

class
zepid.superlearner.stackers.
SuperLearner
(estimators, estimator_labels, folds=10, loss_function='L2', solver='nnls', bounds=1e06, discrete=False, verbose=False)¶ SuperLearner is an implementation of the super learner algorithm, which is a generalized stacking algorithm. Super learner is an approach to combine multiple predictive functions into a singular predictive function that has performance at least as good as the best candidate estimator included (asymptotically). Additionally, it should be noted that super learner converges at the rate with which the best candidate estimator converges.
Briefly, super learner takes an input of candidate estimators for a function. Each of the estimators is run through a traintest crossvalidation algorithm. From the candidate estimators, either the best overall performing candidate (discrete super learner) or a weighted combination of the algorithms is used as the updated predictive function.
Note
SuperLearner does not accept missing data. All missing data decisions have to occur prior to trying to use the SuperLearner procedure.
SuperLearner accepts estimators that are of the SciKitLearn format. Specifically, the candidate estimators must follow the estimator.fit(X, y) and estimator.predict(X) format. Performance has currently been checked for sklearn, pygam, and the estimators included in zepid.superlearner. Please consider opening an issue on GitHub if you find Python libraries that are not supported (but follow the SciKitLearn style).
Note
SuperLearner(discrete=True) returns predictions from the candidate estimator with the greatest coefficient. In the case of a tie, the first candidate estimator with the greatest coefficient is used (as per numpy.argmax behavior).
To compare performances easily, SuperLearner provides both CrossValidated Error and the Relative Efficiency. The CrossValidated Error calculation depends on the chosen loss function. For L2, the loss function is
\[\frac{1}{n} \sum_i (Y_i  \widehat{Y}_i)^2\]For the negativeloglikelihood loss function,
\[\frac{1}{n} \sum_i Y_i \times \ln(\widehat{Y}_i) + (1Y_i) \times \ln(1  \widehat{Y}_i)\]Relative efficiency is the CrossValidated Error for the candidate estimator divided by the CrossValidated Error for the chosen super learner.
Parameters:  estimators (list, array) – Candidate estimators. Must follow sklearn style and not be fit yet
 estimator_labels (list, array) – Labels for the candidate estimators being included
 folds (int, optional) – Number of folds to use during the crossvalidation procedure. It is recommended to be between 1020. The default value is 10fold crossvalidation.
 loss_function (str, optional) – Loss function to use. Options include: L2, NLogLik. L2 should be used for continuous outcomes and NLogLik for binary outcomes
 solver (str, optional) – Optimization algorithm to use to determine the super learner weights. Currently only NonNegative Least Squares is available.
 bounds (float, collection, optional) – Bounding to use for probability. The bounding prevents values of exactly 0 or 1, which will break the loss function evaluation. Default is 1e6.
 discrete (bool, optional) – Whether to use only the estimator with the greatest weight (discrete super learner). Default is False, which uses the super learner including all estimators
 verbose (bool, optional) – Whether to print progress to the console as super learner is being fit. Default is False.
Examples
Setup the environment and data set
>>> import statsmodels.api as sm >>> from sklearn.linear_model import LinearRegression, LogisticRegression >>> from zepid import load_sample_data >>> from zepid.superlearner import EmpiricalMeanSL, StepwiseSL, SuperLearner
>>> fb = sm.families.family.Gaussian() >>> fc = sm.families.family.Binomial() >>> df = load_sample_data(False).dropna() >>> X = np.asarray(df[['art', 'male', 'age0']]) >>> y = np.asarray(df['dead'])
SuperLearner for binary outcomes
>>> # Setting up estimators >>> emp = EmpiricalMeanSL() >>> log = LogisticRegression() >>> step = StepwiseSL(family=fb, selection="backward", order_interaction=1) >>> sl = SuperLearner(estimators=[emp, log, step], estimator_labels=["Mean", "Log", "Step"], loss_function='nloglik') >>> fsl = sl.fit(X, y) >>> fsl.summary() # Summary of CrossValidated Errors >>> fsl.predict(X) # Generating predicted values from super learner
SuperLearner for continuous outcomes
>>> emp = EmpiricalMeanSL() >>> lin = LinearRegression() >>> step = StepwiseSL(family=fc, selection="backward", order_interaction=1) >>> sl = SuperLearner(estimators=[emp, log, step], estimator_labels=["Mean", "Lin", "Step"], loss_function='L2') >>> fsl = sl.fit(X, y) >>> fsl.summary() # Summary of CrossValidated Errors >>> fsl.predict(X) # Generating predicted values from super learner
Discrete Super Learner
>>> sl = SuperLearner([emp, log, step], ["Mean", "Lin", "Step"], loss_function='L2', discrete=True) >>> sl.fit(X, y)
References
Van der Laan MJ, Polley EC, Hubbard AE. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).
Rose S. (2013). Mortality risk score prediction in an elderly population using machine learning. American Journal of Epidemiology, 177(5), 443452.
Methods
fit
(X, y)Fit SuperLearner given the variables X to predict y. predict
(X)Generate predictions using the fit SuperLearner. summary
()Prints the summary information for the fit SuperLearner to the console. 
fit
(X, y)¶ Fit SuperLearner given the variables X to predict y. These variables are directly passed to the candidate estimators. If there is any preprocessing to do outside of the estimator, please do so before passing to fit.
Parameters:  X (numpy.array) – Covariates to predict the target values
 y (numpy.array) – Target values to predict
Returns: Return type: None

predict
(X)¶ Generate predictions using the fit SuperLearner.
Parameters: X (numpy.array) – Covariates to generate predictions of y. Note that X should be in the same format as the X used during the fit() function Returns: Return type: numpy.array of predicted values using either discrete super learner or super learner

summary
()¶ Prints the summary information for the fit SuperLearner to the console.
Returns: Return type: None