# Graphics¶

This page demonstrates some of the different graphics possible to generate with zEpid. These plots are all generated using matplotlib. The functions themselves return matplotlib.axes objects, so users can further edit and customize their plots. Let’s look at some of the plots

## Functional Form Assessment¶

zEpid makes graphical (qualitative) and statistical (quantitative) functional form assessment easy to implement. Functional form assessments are available for either discrete or continuous variables. The distinction only matters for the calculation of the LOESS curve generated in the plots.

Plots and regression model results come from generalized linear models fit with statsmodels.

Let’s look at some examples. We will look at baseline age (discrete variable). We will compare linear, quadratic, and splines for the functional form. First, we set up the data

import zepid as ze
from zepid.graphics import functional_form_plot

df['age0_sq'] = df['age0']**2
df[['rqs0', 'rqs1']] = ze.spline(df, var='age0', term=2, n_knots=3, knots=[30, 40, 55], restricted=True)


### Linear¶

Now that our variables are all prepared, we will look at a basic linear term for age0.

functional_form_plot(df, outcome='dead', var='age0', discrete=True)
plt.show()


In the console, the following results will be printed

Warning: missing observations of model variables are dropped
0  observations were dropped from the functional form assessment
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                   dead   No. Observations:                  547
Model:                            GLM   Df Residuals:                      545
Model Family:                Binomial   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -239.25
Date:                Tue, 26 Jun 2018   Deviance:                       478.51
Time:                        08:25:47   Pearson chi2:                     553.
No. Iterations:                     5   Covariance Type:             nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.6271      0.537     -6.760      0.000      -4.679      -2.575
age0           0.0507      0.013      4.012      0.000       0.026       0.075
==============================================================================
AIC:  482.50783872152573
BIC:  -2957.4167585984537


In the image, the blue line corresponds to the regression line and the shaded blue region is the 95% confidence intervals. The red-dashed line is the statsmodels generated LOESS curve. We can also have the data points that the LOESS curve is fit to plot as well

To implement other functional forms besides linear terms, the optional f_form argument must be supplied. Note that any terms specified in the f_form argument must be part of the data set. We can assess a quadratic functional form like the following

functional_form_plot(df, outcome='dead', var='age0', f_form='age0 + age0_sq', discrete=True)
plt.show()


The f_form argument is used to specify any functional form variables that are coded by the user

### Spline¶

We will now compare using restricted quadratic splines for the functional form of age. To show how users can further edit the plot, we will add dashed lines to designate where the spline knots are located

functional_form_plot(df, outcome='dead', var='age0', f_form='age0 + rqs0 + rqs1', discrete=True)
plt.vlines(30, 0, 0.85, colors='gray', linestyles='--')
plt.vlines(40, 0, 0.85, colors='gray', linestyles='--')
plt.vlines(55, 0, 0.85, colors='gray', linestyles='--')
plt.show()


### Continuous Variables¶

For non-discrete variables (indicated by discrete=False, the default), data is binned into categories automatically. The number of categories is determined via the maximum value minus the minimum divided by 5.

$\frac{(max(X) - min(X))}{5}$

To adjust the number of categories, the continuous variable can be multiplied by some constant. If more categories are desired, then the continuous variable can be multiplied by some constant greater than 1. Conversely, if less categories are desired, then the continuous variable can be multiplied by some constant between 0 and 1. In this example we will look at cd40 which corresponds to baseline viral load.

functional_form_plot(df, outcome='dead', var='cd40')
plt.show()


If we use the current values, the number of categories is indicated in the console output as

A total of 99 categories were created. If you would like to influence the number of categories
the spline is fit to, do the following
Increase: multiply by a constant >1
Decrease: multiply by a constant <1 and >0


We can see that statsmodels has an overflow issue in some exponential. We can decrease the number of categories within cd40 to see if that fixes this. We will decrease the number of categories by multiplying by 0.25.

df['cd4_red'] = df['cd40']*0.25
plt.show()


Now only 24 categories are created and it removes the overflow issue.

## P-value Plot¶

As described and shown in Epidemiology 2nd Edition by K. Rothman, this function is meant to plot the p-value distribution for a variable. From this distribution, p-values and confidence intervals can be visualized to compare or contrast results. Note that this functionality only works for linear variables (i.e. Risk Difference and log(Risk Ratio)). Returning to our results from the Measures section, we will look at plots of the Risk Difference. For a risk difference of -0.049 (SE: 0.042), the plot is

from zepid.graphics import pvalue_plot

pvalue_plot(point=-0.049, sd=0.042)
plt.show()


We can stack multiple p-value plots together. Imagine a systematic review was conducted prior to our study and resulted in a summary risk difference of -0.062 (SE: 0.0231). We can use the p-value plots to visually display the results of our data and the systematic review

from matplotlib.lines import Line2D

pvalue_plot(point=-0.049, sd=0.042, color='b', fill=False)
pvalue_plot(point=-0.062, sd=0.0231, color='r', fill=False)
plt.legend([Line2D([0], [0], color='b', lw=2),
Line2D([0], [0], color='r', lw=2)],
['Our Study', 'Review'])
plt.show()


## Spaghetti Plot¶

Spaghetti plots are a fun (sometimes useful) way to look for outliers/patterns in longitudinal data. The following is an example spaghetti plot using the longitudinal data from zEpid and looking at CD4 T cell count over time.

df = ze.load_sample_data(timevary=True)
ze.graphics.spaghetti_plot(df,idvar='id',variable='cd4',time='enter')
plt.show()


NOTE If your data set is particularly large, a spaghetti plot may take a long time to generate and may not be useful as a visualization. They are generally easiest to observe with a smaller number of participants. However, they can be useful for finding extreme outliers in large data sets.

## Effect Measure Plots¶

Effect measure plots are also referred to as forest plots. Forest plots generally summarize the of various studies and collapse the studies into a single summary measure. Effect measure plots are similar but do not use the same summary measure. For an example, I am going to replicate Figure 2 from my 2017 paper “Influenza vaccination status and outcomes among influenza-associated hospitalizations in Columbus, Ohio (2012-2015)” published in Epidemiology and Infection

The first step to creating the effect measure plot is to create lists containing; labels, point estimates, lower confidence limits, and upper confidence limits

import numpy as np
from zepid.graphics import EffectMeasurePlot

measure = [np.nan, 0.94, np.nan, np.nan, 1.22, np.nan, np.nan, 0.59, np.nan, np.nan, 1.09]
lower = [np.nan, 0.77, np.nan, np.nan, '0.80', np.nan, np.nan, '0.40', np.nan, np.nan, 0.83]
upper = [np.nan, 1.15, np.nan, np.nan, 1.84, np.nan, np.nan, 0.85, np.nan, np.nan, 1.44]


Some general notes about the above code: (1) for blank y-axis labels, a blank string is indicated, (2) for blank measure/confidence intervals, np.nan is specified, (3) for floats ending with a zero, they must be input as characters. If floats that end in 0 (such as 0.80) are put into a list as a string and not a float, the floating 0 will be truncated from the table. Now that our data is all prepared, we can now generate our plot

p = EffectMeasurePlot(label=labs, effect_measure=measure, lcl=lower, ucl=upper)
p.labels(scale='log')
plt.tight_layout()
plt.show()


There are other optional arguments to adjust the plot (colors of points/point shape/etc.). Take a look through the Reference page for available options

NOTE There is one part of the effect measure plot that is not particularly pretty. In the plot() function there is an optional argument t_adjuster. This argument changes the alignment of the table so that the table aligns properly with the plot values. I have NOT figured out a way to do this automatically. Currently, t_adjuster must be changed by the user manually to find a good table alignment. I recommend using changes of 0.01 in t_adjuster until a good alignment is found. Additionally, sometimes the plot will be squished. To fix this, the plot size can be changed by the figsize argument

Receiver-Operator Curves (ROC) are a fundamental tool for diagnosing the sensitivity and specificity of a test over a variety of thresholds. ROC curves can be generated for predicted probabilities from a model or different diagnostics thresholds (ex. ALT levels to predict infections). In this example, we will predict the probability of death among the sample data set. First, we will need to get some predicted probabilities. We will use statsmodels to build a simple predictive model and obtain predicted probabilities.

import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from zepid.graphics import roc

df['age0_sq'] = df['age0']**2
df['cd40sq'] = df['cd40']**2
model = 'dead ~ art + age0 + age0_sq + cd40 + cd40sq + dvl0 + male'
m = smf.glm(model, df, family=f).fit()
df['predicted'] = m.predict(df)


Now with predicted probabilities, we can generate a ROC plot

roc(df.dropna(), true='dead', threshold='predicted')
plt.tight_layout()
plt.show()


Which generates the following plot. For this plot the Youden’s Index is also calculated by default. The following output is printed to the console

----------------------------------------------------------------------
Youden's Index:  0.15328818469754796
Predictive values at Youden's Index
Sensitivity:  0.6739130434782609
Specificity:  0.6857142857142857
----------------------------------------------------------------------


Youden’s index is the solution to the following

$Sensitivity + Specificity - 1$

where Youden’s index is the value that maximizes the above. Basically, it maximizes both sensitivity and specificity. You can learn more from HERE

## Dynamic Risk Plots¶

Dynamic risk plots allow the visualization of how the risk difference/ratio changes over time. For a published example, see HERE and discussed further HERE

For this example, we will borrow our results from our IPTW marginal structural model. We will use the fitted survival functions to obtain the risk estimates for our exposed and unexposed groups. These were generated from the lifelines Kaplan Meier curves (estimated via KaplanMeierFitter).

a = 1 - kme.survival_function_
b = 1 - kmu.survival_function_
dynamic_risk_plot(a, b)
plt.show()


By default, the function returns the risk difference plot. You can also request a risk ratio plot (and with different colors).

dynamic_risk_plot(a, b, measure='RR', point_color='darkred', line_color='r', scale='log')
plt.yticks([0.4, 0.6, 0.8, 1, 2, 4, 6])
plt.show()


The log-transformed risk ratio is also available

dynamic_risk_plot(a, b, measure='RR', point_color='darkgreen', line_color='g', scale='log-transform')
plt.show()


## L’Abbe Plots¶

L’Abbe plots have generally been use to display meta-analysis results. However, I also find them to be a useful tool to explain effect/association measure modification on the additive or the multiplicative scales. Furthermore, it visually demonstrates that when there is a non-null average causal effect, then there must be modification on at least one scale.

To generate a L’Abbe plot, you can use the labbe_plot() function. Below is example code to generate an empty L’Abbe plot.

from zepid.graphics import labbe_plot

labbe_plot()
plt.show()


In this plot, you are presented lines that indicate where stratified measures would need to lie on for there to be no additive / multiplicative interaction. By default, both the additive and multiplicative plots are presented. Let’s look at an example with some data

from zepid.graphics import labbe_plot

labbe_plot(r1=[0.3, 0.5], r0=[0.2, 0.7], color='red')
plt.show()


As seen in the plot, there is both additive and multiplicative interaction. As would be described by Hernan, Robins, and others, there is qualitative modification (estimates are on opposite sides of the null, the dashed-line). Let’s look at one more example,

from zepid.graphics import labbe_plot

labbe_plot(r1=[0.25, 0.5], r0=[0.1, 0.2], color='red')
plt.show()


In this example, there is additive modification, but no multiplicative modification. These plots also can have the number of reference lines displayed changed, and support the keyword arguments of plt.plot() function. See the function documentation for further details.

## Zipper Plot¶

Zipper plots provide an easy way to visualize the performance of confidence intervals in simulations. Confidence intervals across simulations are displayed in a single plot, with the option to color the confidence limits by whether they include the true value. Below is an example of a zipper plot. For ease, I generated the confidence intervals using some random numbers (you would pull the confidence intervals from the estimators in practice).

from zepid.graphics import zipper_plot
lower = np.random.uniform(-0.1, 0.1, size=100)
upper = lower + np.random.uniform(0.1, 0.2, size=100)

zipper_plot(truth=0,
lcl=lower,
ucl=upper,
colors=('blue', 'green'))
plt.show()


In this example, confidence interval coverage would be considered rather poor (if we are expecting the usual 95% coverage).