5. Chapter 7 - Models for Count Data¶
!pip install proplot
Requirement already satisfied: proplot in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (0.6.4)
Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from proplot) (3.3.3)
Requirement already satisfied: python-dateutil>=2.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (0.10.0)
Requirement already satisfied: numpy>=1.15 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.19.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (8.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.4.7)
Requirement already satisfied: six in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->proplot) (1.15.0)
import warnings
import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats
warnings.filterwarnings("ignore")
%pylab inline
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"
Populating the interactive namespace from numpy and matplotlib
cancer_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Cancer.tsv.gz?raw=true", compression="gzip", sep="\t")
cancer_df.head()
time | histology | stage | count | risktime | |
---|---|---|---|---|---|
0 | 1 | 1 | 1 | 9 | 157 |
1 | 1 | 2 | 1 | 5 | 77 |
2 | 1 | 3 | 1 | 1 | 21 |
3 | 2 | 1 | 1 | 2 | 139 |
4 | 2 | 2 | 1 | 2 | 68 |
cancer_df["logrisktime"] = np.log(cancer_df["risktime"])
formula = """count ~ C(histology) + C(stage) + C(time)"""
response, predictors = dmatrices(formula, cancer_df, return_type="dataframe")
fit = sm.GLM(
response, predictors, family=sm.families.Poisson(link=sm.families.links.log()),
offset=cancer_df["logrisktime"]
).fit()
print(fit.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: count No. Observations: 63
Model: GLM Df Residuals: 52
Model Family: Poisson Df Model: 10
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -114.87
Date: Tue, 12 Jan 2021 Deviance: 43.923
Time: 23:01:40 Pearson chi2: 43.1
No. Iterations: 5
Covariance Type: nonrobust
=====================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -3.0093 0.167 -18.073 0.000 -3.336 -2.683
C(histology)[T.2] 0.1624 0.122 1.332 0.183 -0.077 0.401
C(histology)[T.3] 0.1075 0.147 0.729 0.466 -0.181 0.397
C(stage)[T.2] 0.4700 0.174 2.694 0.007 0.128 0.812
C(stage)[T.3] 1.3243 0.152 8.709 0.000 1.026 1.622
C(time)[T.2] -0.1275 0.149 -0.855 0.393 -0.420 0.165
C(time)[T.3] -0.0797 0.164 -0.488 0.626 -0.400 0.241
C(time)[T.4] 0.1189 0.171 0.695 0.487 -0.216 0.454
C(time)[T.5] -0.6651 0.261 -2.552 0.011 -1.176 -0.154
C(time)[T.6] -0.3502 0.243 -1.438 0.150 -0.827 0.127
C(time)[T.7] -0.1752 0.250 -0.701 0.483 -0.665 0.315
=====================================================================================
The increasing coefficients with stage reflect the higher mortality with stage. Stage 3 mortalites are \(exp(1.324) = 3.76\) times higher than staeg 1.
drugs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs_df = drugs_df.rename(columns={"A": "alc", "C": "cig", "M": "mar"})
drugs_df
alc | cig | mar | count | |
---|---|---|---|---|
0 | yes | yes | yes | 911 |
1 | yes | yes | no | 538 |
2 | yes | no | yes | 44 |
3 | yes | no | no | 456 |
4 | no | yes | yes | 3 |
5 | no | yes | no | 43 |
6 | no | no | yes | 2 |
7 | no | no | no | 279 |
formula = """count ~ C(alc) + C(cig) + C(mar)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
mutual_indep = sm.GLM(
response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(mutual_indep.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: count No. Observations: 8
Model: GLM Df Residuals: 4
Model Family: Poisson Df Model: 3
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -667.53
Date: Tue, 12 Jan 2021 Deviance: 1286.0
Time: 23:01:41 Pearson chi2: 1.41e+03
No. Iterations: 5
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 4.1725 0.065 64.234 0.000 4.045 4.300
C(alc)[T.yes] 1.7851 0.060 29.872 0.000 1.668 1.902
C(cig)[T.yes] 0.6493 0.044 14.707 0.000 0.563 0.736
C(mar)[T.yes] -0.3154 0.042 -7.431 0.000 -0.399 -0.232
=================================================================================
l = ["yes", "no"]
formula = """count ~ C(alc, levels=l) + C(cig, levels=l) + C(mar, levels=l) + C(alc, levels=l):C(cig, levels=l) + C(alc, levels=l):C(mar,levels=l) + C(cig,levels=l):C(mar,levels=l)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
homo_association = sm.GLM(
response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(homo_association.summary())
print('AIC: {}'.format(homo_association.aic))
pearson_resid = homo_association.resid_pearson
std_resid = homo_association.resid_response
print(np.sum(pearson_resid**2))
counts = drugs_df["count"]
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: count No. Observations: 8
Model: GLM Df Residuals: 1
Model Family: Poisson Df Model: 6
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -24.709
Date: Tue, 12 Jan 2021 Deviance: 0.37399
Time: 23:01:41 Pearson chi2: 0.401
No. Iterations: 8
Covariance Type: nonrobust
=================================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept 6.8139 0.033 205.699 0.000 6.749 6.879
C(alc, levels=l)[T.no] -5.5283 0.452 -12.225 0.000 -6.415 -4.642
C(cig, levels=l)[T.no] -3.0158 0.152 -19.891 0.000 -3.313 -2.719
C(mar, levels=l)[T.no] -0.5249 0.054 -9.669 0.000 -0.631 -0.418
C(alc, levels=l)[T.no]:C(cig, levels=l)[T.no] 2.0545 0.174 11.803 0.000 1.713 2.396
C(alc, levels=l)[T.no]:C(mar, levels=l)[T.no] 2.9860 0.465 6.426 0.000 2.075 3.897
C(cig, levels=l)[T.no]:C(mar, levels=l)[T.no] 2.8479 0.164 17.382 0.000 2.527 3.169
=================================================================================================================
AIC: 63.4174142317677
0.4011005166525919
df = pd.DataFrame( np.vstack([counts.values,
homo_association.fittedvalues,
homo_association.resid_pearson,
homo_association.resid_response])).T
df.columns = ["count", "fitted", "pearsonr_resid", "std_resid"]
df
count | fitted | pearsonr_resid | std_resid | |
---|---|---|---|---|
0 | 911.0 | 910.38317 | 0.020443 | 0.61683 |
1 | 538.0 | 538.61683 | -0.026578 | -0.61683 |
2 | 44.0 | 44.61683 | -0.092346 | -0.61683 |
3 | 456.0 | 455.38317 | 0.028905 | 0.61683 |
4 | 3.0 | 3.61683 | -0.324341 | -0.61683 |
5 | 43.0 | 42.38317 | 0.094748 | 0.61683 |
6 | 2.0 | 1.38317 | 0.524479 | 0.61683 |
7 | 279.0 | 279.61683 | -0.036888 | -0.61683 |
drugs2_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs2.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs2_df = drugs2_df.rename(columns={"A": "alc", "C": "cig"})
drugs2_df["M_yes_byn"] = drugs2_df["M_yes"]/drugs2_df["n"]
l = ["yes", "no"]
#formula = """M_yes/n ~ C(alc, levels=l) + C(cig, levels=l)"""
#formula = """I(M_yes/n) ~ C(alc) + C(cig)"""
formula = """M_yes_byn ~ C(alc) + C(cig)"""
response, predictors = dmatrices(formula, drugs2_df, return_type="dataframe")
fit = sm.GLM(response,
predictors,
family=sm.families.Binomial(link=sm.families.links.logit()),
weights=drugs2_df["n"]).fit()
print(fit.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: M_yes_byn No. Observations: 4
Model: GLM Df Residuals: 1
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -0.86889
Date: Tue, 12 Jan 2021 Deviance: 0.0017527
Time: 23:01:41 Pearson chi2: 0.00204
No. Iterations: 6
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -5.4367 5.314 -1.023 0.306 -15.853 4.979
C(alc)[T.yes] 3.1326 4.304 0.728 0.467 -5.303 11.569
C(cig)[T.yes] 2.8189 3.920 0.719 0.472 -4.865 10.502
=================================================================================
5.1. Section 7.5.1¶
crabs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Crabs.tsv.gz?raw=true", compression="gzip", sep="\t")
crabs_df.head()
crab | y | weight | width | color | spine | |
---|---|---|---|---|---|---|
1 | 1 | 8 | 3.05 | 28.3 | 2 | 3 |
2 | 2 | 0 | 1.55 | 22.5 | 3 | 3 |
3 | 3 | 9 | 2.30 | 26.0 | 1 | 1 |
4 | 4 | 0 | 2.10 | 24.8 | 3 | 3 |
5 | 5 | 4 | 2.60 | 26.0 | 3 | 3 |
formula = """y ~ 1"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response,
predictors,
family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(fit.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 173
Model: GLM Df Residuals: 172
Model Family: Poisson Df Model: 0
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -494.04
Date: Tue, 12 Jan 2021 Deviance: 632.79
Time: 23:01:42 Pearson chi2: 584.
No. Iterations: 5
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0713 0.044 24.074 0.000 0.984 1.158
==============================================================================
formula = """y ~ 1"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response,
predictors,
family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
# (crabs_df["y"].var()-crabs_df["y"].mean())/(crabs_df["y"].mean()**2)#- fit.mu
overdispersion = fit.pearson_chi2 / fit.df_resid
print(fit.summary())
print('Overdispersion: {}'.format(overdispersion))
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 173
Model: GLM Df Residuals: 172
Model Family: NegativeBinomial Df Model: 0
Link Function: log Scale: 0.86643
Method: IRLS Log-Likelihood: -444.42
Date: Tue, 12 Jan 2021 Deviance: 224.93
Time: 23:01:42 Pearson chi2: 149.
No. Iterations: 7
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0713 0.082 13.064 0.000 0.911 1.232
==============================================================================
Overdispersion: 0.8664294490778556
import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = cm.ZeroInflatedPoisson(response,
predictors).fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 2.205865
Iterations: 6
Function evaluations: 8
Gradient evaluations: 8
ZeroInflatedPoisson Regression Results
===============================================================================
Dep. Variable: y No. Observations: 173
Model: ZeroInflatedPoisson Df Residuals: 172
Method: MLE Df Model: 0
Date: Tue, 12 Jan 2021 Pseudo R-squ.: 2.817e-11
Time: 23:01:42 Log-Likelihood: -381.61
converged: True LL-Null: -381.61
Covariance Type: nonrobust LLR p-value: nan
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
inflate_const -0.6139 0.162 -3.794 0.000 -0.931 -0.297
Intercept 1.5038 0.046 32.956 0.000 1.414 1.593
=================================================================================
import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = cm.ZeroInflatedNegativeBinomialP(response,
predictors,
p=2).fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 2.134980
Iterations: 12
Function evaluations: 13
Gradient evaluations: 13
ZeroInflatedNegativeBinomialP Regression Results
=========================================================================================
Dep. Variable: y No. Observations: 173
Model: ZeroInflatedNegativeBinomialP Df Residuals: 172
Method: MLE Df Model: 0
Date: Tue, 12 Jan 2021 Pseudo R-squ.: 3.957e-11
Time: 23:01:42 Log-Likelihood: -369.35
converged: True LL-Null: -369.35
Covariance Type: nonrobust LLR p-value: nan
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
inflate_const -0.7279 0.183 -3.973 0.000 -1.087 -0.369
Intercept 1.4653 0.068 21.440 0.000 1.331 1.599
alpha 0.2242 0.078 2.864 0.004 0.071 0.378
=================================================================================
formula = """y ~ weight + color"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = cm.ZeroInflatedNegativeBinomialP(response,
predictors).fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 2.115471
Iterations: 19
Function evaluations: 22
Gradient evaluations: 22
ZeroInflatedNegativeBinomialP Regression Results
=========================================================================================
Dep. Variable: y No. Observations: 173
Model: ZeroInflatedNegativeBinomialP Df Residuals: 170
Method: MLE Df Model: 2
Date: Tue, 12 Jan 2021 Pseudo R-squ.: 0.009138
Time: 23:01:42 Log-Likelihood: -365.98
converged: True LL-Null: -369.35
Covariance Type: nonrobust LLR p-value: 0.03421
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
inflate_const -0.8296 0.219 -3.792 0.000 -1.258 -0.401
Intercept 0.7198 0.429 1.678 0.093 -0.121 1.560
weight 0.3104 0.132 2.356 0.018 0.052 0.569
color -0.0423 0.099 -0.426 0.670 -0.237 0.152
alpha 0.2434 0.092 2.641 0.008 0.063 0.424
=================================================================================
formula = """y ~ weight + color"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response,
predictors,
family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
print(fit.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 173
Model: GLM Df Residuals: 170
Model Family: NegativeBinomial Df Model: 2
Link Function: log Scale: 0.95361
Method: IRLS Log-Likelihood: -391.41
Date: Tue, 12 Jan 2021 Deviance: 201.33
Time: 23:01:42 Pearson chi2: 162.
No. Iterations: 15
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.3177 0.532 -0.597 0.550 -1.360 0.725
weight 0.7055 0.155 4.561 0.000 0.402 1.009
color -0.1734 0.115 -1.505 0.132 -0.399 0.052
==============================================================================