5. Chapter 7 - Models for Count Data¶

!pip install proplot

Requirement already satisfied: proplot in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (0.6.4)
Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from proplot) (3.3.3)
Requirement already satisfied: python-dateutil>=2.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (0.10.0)
Requirement already satisfied: numpy>=1.15 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.19.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (8.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from matplotlib->proplot) (2.4.7)
Requirement already satisfied: six in /opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->proplot) (1.15.0)

import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"

Populating the interactive namespace from numpy and matplotlib

cancer_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Cancer.tsv.gz?raw=true", compression="gzip", sep="\t")
cancer_df.head()

	time	histology	stage	count	risktime
0	1	1	1	9	157
1	1	2	1	5	77
2	1	3	1	1	21
3	2	1	1	2	139
4	2	2	1	2	68

cancer_df["logrisktime"] = np.log(cancer_df["risktime"])

formula = """count ~ C(histology) + C(stage) + C(time)"""
response, predictors = dmatrices(formula, cancer_df, return_type="dataframe")
fit = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log()),
    offset=cancer_df["logrisktime"]
).fit()
print(fit.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                   63
Model:                            GLM   Df Residuals:                       52
Model Family:                 Poisson   Df Model:                           10
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -114.87
Date:                Tue, 12 Jan 2021   Deviance:                       43.923
Time:                        23:01:40   Pearson chi2:                     43.1
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -3.0093      0.167    -18.073      0.000      -3.336      -2.683
C(histology)[T.2]     0.1624      0.122      1.332      0.183      -0.077       0.401
C(histology)[T.3]     0.1075      0.147      0.729      0.466      -0.181       0.397
C(stage)[T.2]         0.4700      0.174      2.694      0.007       0.128       0.812
C(stage)[T.3]         1.3243      0.152      8.709      0.000       1.026       1.622
C(time)[T.2]         -0.1275      0.149     -0.855      0.393      -0.420       0.165
C(time)[T.3]         -0.0797      0.164     -0.488      0.626      -0.400       0.241
C(time)[T.4]          0.1189      0.171      0.695      0.487      -0.216       0.454
C(time)[T.5]         -0.6651      0.261     -2.552      0.011      -1.176      -0.154
C(time)[T.6]         -0.3502      0.243     -1.438      0.150      -0.827       0.127
C(time)[T.7]         -0.1752      0.250     -0.701      0.483      -0.665       0.315
=====================================================================================

The increasing coefficients with stage reflect the higher mortality with stage. Stage 3 mortalites are \(exp(1.324) = 3.76\) times higher than staeg 1.

drugs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs_df = drugs_df.rename(columns={"A": "alc", "C": "cig", "M": "mar"})
drugs_df

	alc	cig	mar	count
0	yes	yes	yes	911
1	yes	yes	no	538
2	yes	no	yes	44
3	yes	no	no	456
4	no	yes	yes	3
5	no	yes	no	43
6	no	no	yes	2
7	no	no	no	279

formula = """count ~ C(alc) + C(cig) + C(mar)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
mutual_indep = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(mutual_indep.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                    8
Model:                            GLM   Df Residuals:                        4
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -667.53
Date:                Tue, 12 Jan 2021   Deviance:                       1286.0
Time:                        23:01:41   Pearson chi2:                 1.41e+03
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         4.1725      0.065     64.234      0.000       4.045       4.300
C(alc)[T.yes]     1.7851      0.060     29.872      0.000       1.668       1.902
C(cig)[T.yes]     0.6493      0.044     14.707      0.000       0.563       0.736
C(mar)[T.yes]    -0.3154      0.042     -7.431      0.000      -0.399      -0.232
=================================================================================

l = ["yes", "no"]
formula = """count ~ C(alc, levels=l) + C(cig, levels=l) + C(mar, levels=l) + C(alc, levels=l):C(cig, levels=l) + C(alc, levels=l):C(mar,levels=l) + C(cig,levels=l):C(mar,levels=l)"""
response, predictors = dmatrices(formula, drugs_df, return_type="dataframe")
homo_association = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.log())).fit()

print(homo_association.summary())
print('AIC: {}'.format(homo_association.aic))
pearson_resid = homo_association.resid_pearson
std_resid = homo_association.resid_response
print(np.sum(pearson_resid**2))

counts = drugs_df["count"]

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  count   No. Observations:                    8
Model:                            GLM   Df Residuals:                        1
Model Family:                 Poisson   Df Model:                            6
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -24.709
Date:                Tue, 12 Jan 2021   Deviance:                      0.37399
Time:                        23:01:41   Pearson chi2:                    0.401
No. Iterations:                     8                                         
Covariance Type:            nonrobust                                         
=================================================================================================================
                                                    coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept                                         6.8139      0.033    205.699      0.000       6.749       6.879
C(alc, levels=l)[T.no]                           -5.5283      0.452    -12.225      0.000      -6.415      -4.642
C(cig, levels=l)[T.no]                           -3.0158      0.152    -19.891      0.000      -3.313      -2.719
C(mar, levels=l)[T.no]                           -0.5249      0.054     -9.669      0.000      -0.631      -0.418
C(alc, levels=l)[T.no]:C(cig, levels=l)[T.no]     2.0545      0.174     11.803      0.000       1.713       2.396
C(alc, levels=l)[T.no]:C(mar, levels=l)[T.no]     2.9860      0.465      6.426      0.000       2.075       3.897
C(cig, levels=l)[T.no]:C(mar, levels=l)[T.no]     2.8479      0.164     17.382      0.000       2.527       3.169
=================================================================================================================
AIC: 63.4174142317677
0.4011005166525919

df = pd.DataFrame( np.vstack([counts.values,
                              homo_association.fittedvalues, 
                              homo_association.resid_pearson, 
                              homo_association.resid_response])).T
df.columns = ["count", "fitted", "pearsonr_resid", "std_resid"]
df

	count	fitted	pearsonr_resid	std_resid
0	911.0	910.38317	0.020443	0.61683
1	538.0	538.61683	-0.026578	-0.61683
2	44.0	44.61683	-0.092346	-0.61683
3	456.0	455.38317	0.028905	0.61683
4	3.0	3.61683	-0.324341	-0.61683
5	43.0	42.38317	0.094748	0.61683
6	2.0	1.38317	0.524479	0.61683
7	279.0	279.61683	-0.036888	-0.61683

drugs2_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Drugs2.tsv.gz?raw=true", compression="gzip", sep="\t")
drugs2_df = drugs2_df.rename(columns={"A": "alc", "C": "cig"})
drugs2_df["M_yes_byn"] = drugs2_df["M_yes"]/drugs2_df["n"]

l = ["yes", "no"]

#formula = """M_yes/n ~ C(alc, levels=l) + C(cig, levels=l)"""
#formula = """I(M_yes/n) ~ C(alc) + C(cig)"""
formula = """M_yes_byn ~ C(alc) + C(cig)"""

response, predictors = dmatrices(formula, drugs2_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.Binomial(link=sm.families.links.logit()),
             weights=drugs2_df["n"]).fit()
print(fit.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              M_yes_byn   No. Observations:                    4
Model:                            GLM   Df Residuals:                        1
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:               -0.86889
Date:                Tue, 12 Jan 2021   Deviance:                    0.0017527
Time:                        23:01:41   Pearson chi2:                  0.00204
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -5.4367      5.314     -1.023      0.306     -15.853       4.979
C(alc)[T.yes]     3.1326      4.304      0.728      0.467      -5.303      11.569
C(cig)[T.yes]     2.8189      3.920      0.719      0.472      -4.865      10.502
=================================================================================

5.1. Section 7.5.1¶

crabs_df = pd.read_csv("https://github.com/saketkc/pyFLGLM/blob/master/data/Crabs.tsv.gz?raw=true", compression="gzip", sep="\t")
crabs_df.head()

	crab	y	weight	width	color	spine
1	1	8	3.05	28.3	2	3
2	2	0	1.55	22.5	3	3
3	3	9	2.30	26.0	1	1
4	4	0	2.10	24.8	3	3
5	5	4	2.60	26.0	3	3

formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.Poisson(link=sm.families.links.log())).fit()
print(fit.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      172
Model Family:                 Poisson   Df Model:                            0
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -494.04
Date:                Tue, 12 Jan 2021   Deviance:                       632.79
Time:                        23:01:42   Pearson chi2:                     584.
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0713      0.044     24.074      0.000       0.984       1.158
==============================================================================

formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit = sm.GLM(response, 
             predictors, 
             family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
#  (crabs_df["y"].var()-crabs_df["y"].mean())/(crabs_df["y"].mean()**2)#- fit.mu             
overdispersion = fit.pearson_chi2 / fit.df_resid
print(fit.summary())

print('Overdispersion: {}'.format(overdispersion))

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      172
Model Family:        NegativeBinomial   Df Model:                            0
Link Function:                    log   Scale:                         0.86643
Method:                          IRLS   Log-Likelihood:                -444.42
Date:                Tue, 12 Jan 2021   Deviance:                       224.93
Time:                        23:01:42   Pearson chi2:                     149.
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0713      0.082     13.064      0.000       0.911       1.232
==============================================================================
Overdispersion: 0.8664294490778556

import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedPoisson(response, 
                             predictors).fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 2.205865
         Iterations: 6
         Function evaluations: 8
         Gradient evaluations: 8
                     ZeroInflatedPoisson Regression Results                    
===============================================================================
Dep. Variable:                       y   No. Observations:                  173
Model:             ZeroInflatedPoisson   Df Residuals:                      172
Method:                            MLE   Df Model:                            0
Date:                 Tue, 12 Jan 2021   Pseudo R-squ.:               2.817e-11
Time:                         23:01:42   Log-Likelihood:                -381.61
converged:                        True   LL-Null:                       -381.61
Covariance Type:             nonrobust   LLR p-value:                       nan
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.6139      0.162     -3.794      0.000      -0.931      -0.297
Intercept         1.5038      0.046     32.956      0.000       1.414       1.593
=================================================================================

import statsmodels.discrete.count_model as cm
formula = """y ~ 1"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedNegativeBinomialP(response, 
                                                           predictors, 
                                                           p=2).fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 2.134980
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13

                     ZeroInflatedNegativeBinomialP Regression Results                    
=========================================================================================
Dep. Variable:                                 y   No. Observations:                  173
Model:             ZeroInflatedNegativeBinomialP   Df Residuals:                      172
Method:                                      MLE   Df Model:                            0
Date:                           Tue, 12 Jan 2021   Pseudo R-squ.:               3.957e-11
Time:                                   23:01:42   Log-Likelihood:                -369.35
converged:                                  True   LL-Null:                       -369.35
Covariance Type:                       nonrobust   LLR p-value:                       nan
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.7279      0.183     -3.973      0.000      -1.087      -0.369
Intercept         1.4653      0.068     21.440      0.000       1.331       1.599
alpha             0.2242      0.078      2.864      0.004       0.071       0.378
=================================================================================

formula = """y ~ weight + color"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = cm.ZeroInflatedNegativeBinomialP(response, 
                                       predictors).fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 2.115471
         Iterations: 19
         Function evaluations: 22
         Gradient evaluations: 22

                     ZeroInflatedNegativeBinomialP Regression Results                    
=========================================================================================
Dep. Variable:                                 y   No. Observations:                  173
Model:             ZeroInflatedNegativeBinomialP   Df Residuals:                      170
Method:                                      MLE   Df Model:                            2
Date:                           Tue, 12 Jan 2021   Pseudo R-squ.:                0.009138
Time:                                   23:01:42   Log-Likelihood:                -365.98
converged:                                  True   LL-Null:                       -369.35
Covariance Type:                       nonrobust   LLR p-value:                   0.03421
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
inflate_const    -0.8296      0.219     -3.792      0.000      -1.258      -0.401
Intercept         0.7198      0.429      1.678      0.093      -0.121       1.560
weight            0.3104      0.132      2.356      0.018       0.052       0.569
color            -0.0423      0.099     -0.426      0.670      -0.237       0.152
alpha             0.2434      0.092      2.641      0.008       0.063       0.424
=================================================================================

formula = """y ~ weight + color"""

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")

fit = sm.GLM(response, 
             predictors, 
             family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit(scale='x2')
print(fit.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      170
Model Family:        NegativeBinomial   Df Model:                            2
Link Function:                    log   Scale:                         0.95361
Method:                          IRLS   Log-Likelihood:                -391.41
Date:                Tue, 12 Jan 2021   Deviance:                       201.33
Time:                        23:01:42   Pearson chi2:                     162.
No. Iterations:                    15                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3177      0.532     -0.597      0.550      -1.360       0.725
weight         0.7055      0.155      4.561      0.000       0.402       1.009
color         -0.1734      0.115     -1.505      0.132      -0.399       0.052
==============================================================================

	alc	cig	mar	count
0	yes	yes	yes	911
1	yes	yes	no	538
2	yes	no	yes	44
3	yes	no	no	456
4	no	yes	yes	3
5	no	yes	no	43
6	no	no	yes	2
7	no	no	no	279

	alc	cig	mar	count
0	yes	yes	yes	911
1	yes	yes	no	538
2	yes	no	yes	44
3	yes	no	no	456
4	no	yes	yes	3
5	no	yes	no	43
6	no	no	yes	2
7	no	no	no	279

Foundations of Linear and Generalized Linear Models with statsmodels

5. Chapter 7 - Models for Count Data¶

5.1. Section 7.5.1¶

	alc	cig	mar	count
0	yes	yes	yes	911
1	yes	yes	no	538
2	yes	no	yes	44
3	yes	no	no	456
4	no	yes	yes	3
5	no	yes	no	43
6	no	no	yes	2
7	no	no	no	279