Open In Colab

5. Chapter 7 - Models for Count Data

!pip install proplot
Requirement already satisfied: proplot in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (0.6.4)
Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from proplot) (3.3.3)
Requirement already satisfied: python-dateutil>=2.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (0.10.0)
Requirement already satisfied: numpy>=1.15 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.19.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (8.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.4.7)
Requirement already satisfied: six in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->proplot) (1.15.0)
import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"
Populating the interactive namespace from numpy and matplotlib
cancer_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Cancer.tsv.gz?raw=true", compression="gzip", sep="\t")
cancer_df.head()
time histology stage count risktime
0 1 1 1 9 157
1 1 2 1 5 77
2 1 3 1 1 21
3 2 1 1 2 139
4 2 2 1 2 68
cancer_df["logrisktime"] = np.log(cancer_df["risktime"])

formula = """count ~ C(histology) + C(stage) + C(time)"""
response, predictors = dmatrices(formula, cancer_df, return_type="dataframe")
fit = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log()),
    offset=cancer_df["logrisktime"]
).fit()
print(fit.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                   63
Model:                            GLM   Df Residuals:                       52
Model Family:                 Poisson   Df Model:                           10
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -114.87
Date:                Tue, 12 Jan 2021   Deviance:                       43.923
Time:                        23:01:40   Pearson chi2:                     43.1
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -3.0093      0.167    -18.073      0.000      -3.336      -2.683
C(histology)[T.2]     0.1624      0.122      1.332      0.183      -0.077       0.401
C(histology)[T.3]     0.1075      0.147      0.729      0.466      -0.181       0.397
C(stage)[T.2]         0.4700      0.174      2.694      0.007       0.128       0.812
C(stage)[T.3]         1.3243      0.152      8.709      0.000       1.026       1.622
C(time)[T.2]         -0.1275      0.149     -0.855      0.393      -0.420       0.165
C(time)[T.3]         -0.0797      0.164     -0.488      0.626      -0.400       0.241
C(time)[T.4]          0.1189      0.171      0.695      0.487      -0.216       0.454
C(time)[T.5]         -0.6651      0.261     -2.552      0.011      -1.176      -0.154
C(time)[T.6]         -0.3502      0.243     -1.438      0.150      -0.827       0.127
C(time)[T.7]         -0.1752      0.250     -0.701      0.483      -0.665       0.315
=====================================================================================

The increasing coefficients with stage reflect the higher mortality with stage. Stage 3 mortalites are \(exp(1.324) = 3.76\) times higher than staeg 1.

drugs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs_df = drugs_df.rename(columns={"A": "alc", "C": "cig", "M": "mar"})
drugs_df
alc cig mar count
0 yes yes yes 911
1 yes yes no 538
2 yes no yes 44
3 yes no no 456
4 no yes yes 3
5 no yes no 43
6 no no yes 2
7 no no no 279
formula = """count ~ C(alc) + C(cig) + C(mar)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
mutual_indep = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(mutual_indep.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                    8
Model:                            GLM   Df Residuals:                        4
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -667.53
Date:                Tue, 12 Jan 2021   Deviance:                       1286.0
Time:                        23:01:41   Pearson chi2:                 1.41e+03
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         4.1725      0.065     64.234      0.000       4.045       4.300
C(alc)[T.yes]     1.7851      0.060     29.872      0.000       1.668       1.902
C(cig)[T.yes]     0.6493      0.044     14.707      0.000       0.563       0.736
C(mar)[T.yes]    -0.3154      0.042     -7.431      0.000      -0.399      -0.232
=================================================================================
l = ["yes", "no"]
formula = """count ~ C(alc, levels=l) + C(cig, levels=l) + C(mar, levels=l) + C(alc, levels=l):C(cig, levels=l) + C(alc, levels=l):C(mar,levels=l) + C(cig,levels=l):C(mar,levels=l)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
homo_association = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()

print(homo_association.summary())
print('AIC: {}'.format(homo_association.aic))
pearson_resid = homo_association.resid_pearson
std_resid = homo_association.resid_response
print(np.sum(pearson_resid**2))

counts = drugs_df["count"]
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                    8
Model:                            GLM   Df Residuals:                        1
Model Family:                 Poisson   Df Model:                            6
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -24.709
Date:                Tue, 12 Jan 2021   Deviance:                      0.37399
Time:                        23:01:41   Pearson chi2:                    0.401
No. Iterations:                     8                                         
Covariance Type:            nonrobust                                         
=================================================================================================================
                                                    coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept                                         6.8139      0.033    205.699      0.000       6.749       6.879
C(alc, levels=l)[T.no]                           -5.5283      0.452    -12.225      0.000      -6.415      -4.642
C(cig, levels=l)[T.no]                           -3.0158      0.152    -19.891      0.000      -3.313      -2.719
C(mar, levels=l)[T.no]                           -0.5249      0.054     -9.669      0.000      -0.631      -0.418
C(alc, levels=l)[T.no]:C(cig, levels=l)[T.no]     2.0545      0.174     11.803      0.000       1.713       2.396
C(alc, levels=l)[T.no]:C(mar, levels=l)[T.no]     2.9860      0.465      6.426      0.000       2.075       3.897
C(cig, levels=l)[T.no]:C(mar, levels=l)[T.no]     2.8479      0.164     17.382      0.000       2.527       3.169
=================================================================================================================
AIC: 63.4174142317677
0.4011005166525919
df = pd.DataFrame( np.vstack([counts.values,
                              homo_association.fittedvalues, 
                              homo_association.resid_pearson, 
                              homo_association.resid_response])).T
df.columns = ["count", "fitted", "pearsonr_resid", "std_resid"]
df
count fitted pearsonr_resid std_resid
0 911.0 910.38317 0.020443 0.61683
1 538.0 538.61683 -0.026578 -0.61683
2 44.0 44.61683 -0.092346 -0.61683
3 456.0 455.38317 0.028905 0.61683
4 3.0 3.61683 -0.324341 -0.61683
5 43.0 42.38317 0.094748 0.61683
6 2.0 1.38317 0.524479 0.61683
7 279.0 279.61683 -0.036888 -0.61683
drugs2_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs2.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs2_df = drugs2_df.rename(columns={"A": "alc", "C": "cig"})
drugs2_df["M_yes_byn"] = drugs2_df["M_yes"]/drugs2_df["n"]
l = ["yes", "no"]

#formula = """M_yes/n ~ C(alc, levels=l) + C(cig, levels=l)"""
#formula = """I(M_yes/n) ~ C(alc) + C(cig)"""
formula = """M_yes_byn ~ C(alc) + C(cig)"""

response, predictors = dmatrices(formula, drugs2_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.Binomial(link=sm.families.links.logit()),
             weights=drugs2_df["n"]).fit()
print(fit.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              M_yes_byn   No. Observations:                    4
Model:                            GLM   Df Residuals:                        1
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:               -0.86889
Date:                Tue, 12 Jan 2021   Deviance:                    0.0017527
Time:                        23:01:41   Pearson chi2:                  0.00204
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -5.4367      5.314     -1.023      0.306     -15.853       4.979
C(alc)[T.yes]     3.1326      4.304      0.728      0.467      -5.303      11.569
C(cig)[T.yes]     2.8189      3.920      0.719      0.472      -4.865      10.502
=================================================================================

5.1. Section 7.5.1

crabs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Crabs.tsv.gz?raw=true", compression="gzip", sep="\t")
crabs_df.head()
crab y weight width color spine
1 1 8 3.05 28.3 2 3
2 2 0 1.55 22.5 3 3
3 3 9 2.30 26.0 1 1
4 4 0 2.10 24.8 3 3
5 5 4 2.60 26.0 3 3
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(fit.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      172
Model Family:                 Poisson   Df Model:                            0
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -494.04
Date:                Tue, 12 Jan 2021   Deviance:                       632.79
Time:                        23:01:42   Pearson chi2:                     584.
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0713      0.044     24.074      0.000       0.984       1.158
==============================================================================
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
#  (crabs_df["y"].var()-crabs_df["y"].mean())/(crabs_df["y"].mean()**2)#- fit.mu             
overdispersion = fit.pearson_chi2 / fit.df_resid
print(fit.summary())

print('Overdispersion: {}'.format(overdispersion))
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      172
Model Family:        NegativeBinomial   Df Model:                            0
Link Function:                    log   Scale:                         0.86643
Method:                          IRLS   Log-Likelihood:                -444.42
Date:                Tue, 12 Jan 2021   Deviance:                       224.93
Time:                        23:01:42   Pearson chi2:                     149.
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0713      0.082     13.064      0.000       0.911       1.232
==============================================================================
Overdispersion: 0.8664294490778556
import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedPoisson(response, 
                             predictors).fit()
print(fit.summary())
Optimization terminated successfully.
         Current function value: 2.205865
         Iterations: 6
         Function evaluations: 8
         Gradient evaluations: 8
                     ZeroInflatedPoisson Regression Results                    
===============================================================================
Dep. Variable:                       y   No. Observations:                  173
Model:             ZeroInflatedPoisson   Df Residuals:                      172
Method:                            MLE   Df Model:                            0
Date:                 Tue, 12 Jan 2021   Pseudo R-squ.:               2.817e-11
Time:                         23:01:42   Log-Likelihood:                -381.61
converged:                        True   LL-Null:                       -381.61
Covariance Type:             nonrobust   LLR p-value:                       nan
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.6139      0.162     -3.794      0.000      -0.931      -0.297
Intercept         1.5038      0.046     32.956      0.000       1.414       1.593
=================================================================================
import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedNegativeBinomialP(response, 
                                                           predictors, 
                                                           p=2).fit()
print(fit.summary())
Optimization terminated successfully.
         Current function value: 2.134980
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13
                     ZeroInflatedNegativeBinomialP Regression Results                    
=========================================================================================
Dep. Variable:                                 y   No. Observations:                  173
Model:             ZeroInflatedNegativeBinomialP   Df Residuals:                      172
Method:                                      MLE   Df Model:                            0
Date:                           Tue, 12 Jan 2021   Pseudo R-squ.:               3.957e-11
Time:                                   23:01:42   Log-Likelihood:                -369.35
converged:                                  True   LL-Null:                       -369.35
Covariance Type:                       nonrobust   LLR p-value:                       nan
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.7279      0.183     -3.973      0.000      -1.087      -0.369
Intercept         1.4653      0.068     21.440      0.000       1.331       1.599
alpha             0.2242      0.078      2.864      0.004       0.071       0.378
=================================================================================
formula = """y ~ weight + color"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedNegativeBinomialP(response, 
                                       predictors).fit()
print(fit.summary())
Optimization terminated successfully.
         Current function value: 2.115471
         Iterations: 19
         Function evaluations: 22
         Gradient evaluations: 22
                     ZeroInflatedNegativeBinomialP Regression Results                    
=========================================================================================
Dep. Variable:                                 y   No. Observations:                  173
Model:             ZeroInflatedNegativeBinomialP   Df Residuals:                      170
Method:                                      MLE   Df Model:                            2
Date:                           Tue, 12 Jan 2021   Pseudo R-squ.:                0.009138
Time:                                   23:01:42   Log-Likelihood:                -365.98
converged:                                  True   LL-Null:                       -369.35
Covariance Type:                       nonrobust   LLR p-value:                   0.03421
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.8296      0.219     -3.792      0.000      -1.258      -0.401
Intercept         0.7198      0.429      1.678      0.093      -0.121       1.560
weight            0.3104      0.132      2.356      0.018       0.052       0.569
color            -0.0423      0.099     -0.426      0.670      -0.237       0.152
alpha             0.2434      0.092      2.641      0.008       0.063       0.424
=================================================================================
formula = """y ~ weight + color"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = sm.GLM(response, 
             predictors, 
             family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
print(fit.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      170
Model Family:        NegativeBinomial   Df Model:                            2
Link Function:                    log   Scale:                         0.95361
Method:                          IRLS   Log-Likelihood:                -391.41
Date:                Tue, 12 Jan 2021   Deviance:                       201.33
Time:                        23:01:42   Pearson chi2:                     162.
No. Iterations:                    15                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3177      0.532     -0.597      0.550      -1.360       0.725
weight         0.7055      0.155      4.561      0.000       0.402       1.009
color         -0.1734      0.115     -1.505      0.132      -0.399       0.052
==============================================================================