1. Chapter 1 - Introduction to Linear and Generalized Linear Models¶

import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/proplot/config.py:1454: ProPlotWarning: Rebuilding font cache.

Populating the interactive namespace from numpy and matplotlib

crabs_df = pd.read_csv("../data/Crabs.tsv.gz", sep="\t")
crabs_df.head()

	crab	y	weight	width	color	spine
1	1	8	3.05	28.3	2	3
2	2	0	1.55	22.5	3	3
3	3	9	2.30	26.0	1	1
4	4	0	2.10	24.8	3	3
5	5	4	2.60	26.0	3	3

This data comes from a study of female horseshoe crabs (citation unknown). During spawning session, the females migrate to the shore to brred. The males then attach themselves to females’ posterior spine while the females burrows into the sand and lays cluster of eggs. The fertilization of eggs happens externally in the sand beneath the pair. During this spanwing, multulpe males may cluster the pair and may also fertilize the eggs. These males are called satellites.

crab: observation index

y: Number of satellites attached

weight: weight of the female crab

color: color of the female

spine:condition of female’s spine

print((crabs_df["y"].mean(), crabs_df["y"].var()))

(2.9190751445086707, 9.912017744320472)

sns.distplot(crabs_df["y"], kde=False, color="slateblue")

<AxesSubplot:xlabel='y'>

pd.crosstab(index=crabs_df["y"], columns="count")

col_0	count
y
0	62
1	16
2	9
3	19
4	19
5	15
6	13
7	4
8	6
9	3
10	3
11	1
12	1
14	1
15	1

formula = """y ~ 1"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_pois = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.identity())
).fit()
print(fit_pois.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      172
Model Family:                 Poisson   Df Model:                            0
Link Function:               identity   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -494.04
Date:                Tue, 12 Jan 2021   Deviance:                       632.79
Time:                        23:01:17   Pearson chi2:                     584.
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9191      0.130     22.472      0.000       2.664       3.174
==============================================================================

Fitting a Poisson distribution with a GLM containing only an iontercept and using identity link function gives the estimate of intercept which is essentially the mean of y. But poisson has the same mean as its variance. The sample variance of 9.92 suggests that a poisson fit is not appropriate here.

1.1. Linear Model Using Weight to Predict Satellite Counts¶

print((crabs_df["weight"].mean(), crabs_df["weight"].var()))

(2.437190751445087, 0.33295809712326924)

print(crabs_df["weight"].quantile(q=[0, 0.25, 0.5, 0.75, 1]))

00    1.20
25    2.00
50    2.35
75    2.85
00    5.20
Name: weight, dtype: float64

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(crabs_df["weight"], crabs_df["y"])
ax.set_xlabel("weight")
ax.set_ylabel("y")

Text(0, 0.5, 'y')

The plot shows that there is no clear trend in relation between y (number of satellites) and weight.

1.2. Fit a LM vs GLM (Gaussian)¶

formula = """y ~ weight"""
fit_weight = smf.ols(formula=formula, data=crabs_df).fit()
print(fit_weight.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.136
Model:                            OLS   Adj. R-squared:                  0.131
Method:                 Least Squares   F-statistic:                     27.00
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           5.75e-07
Time:                        23:01:17   Log-Likelihood:                -430.70
No. Observations:                 173   AIC:                             865.4
Df Residuals:                     171   BIC:                             871.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.9911      0.971     -2.050      0.042      -3.908      -0.074
weight         2.0147      0.388      5.196      0.000       1.249       2.780
==============================================================================
Omnibus:                       38.273   Durbin-Watson:                   1.750
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               58.768
Skew:                           1.188   Prob(JB):                     1.73e-13
Kurtosis:                       4.584   Cond. No.                         12.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_weight2 = sm.GLM(response, predictors, family=sm.families.Gaussian()).fit()
print(fit_weight2.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      171
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                          8.6106
Method:                          IRLS   Log-Likelihood:                -430.70
Date:                Tue, 12 Jan 2021   Deviance:                       1472.4
Time:                        23:01:17   Pearson chi2:                 1.47e+03
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.9911      0.971     -2.050      0.040      -3.894      -0.088
weight         2.0147      0.388      5.196      0.000       1.255       2.775
==============================================================================

Thus OLS and a GLM using Gaussian family and identity link are one and the same.

1.3. Plotting the linear fit¶

fig, ax = plt.subplots()
ax.scatter(crabs_df["weight"], crabs_df["y"])
line = fit_weight2.params[0] + fit_weight2.params[1] * crabs_df["weight"]

ax.plot(crabs_df["weight"], line, color="#f03b20")

[<matplotlib.lines.Line2D at 0x7f37426eff10>]

1.4. Comparing Mean Numbers of Satellites by Crab Color¶

crabs_df["color"].value_counts()

  95
  44
  22
  12
Name: color, dtype: int64

color: 1 = medium light, 2 = medium, 3 = medium dark, 4 = dark

crabs_df.groupby("color").agg(["mean", "var"])[["y"]]

	y
	mean	var
color
1	4.083333	9.719697
2	3.294737	10.273908
3	2.227273	6.737844
4	2.045455	13.093074

Majority of the crabs are of medoum color and the mean response also decreases as the color gets darker.

If we fit a linear model between \(y\) and \(color\) using sm.ols, color is treated as a quantitative variable:

mod = smf.ols(formula="y ~ color", data=crabs_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     6.459
Date:                Tue, 12 Jan 2021   Prob (F-statistic):             0.0119
Time:                        23:01:17   Log-Likelihood:                -440.18
No. Observations:                 173   AIC:                             884.4
Df Residuals:                     171   BIC:                             890.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.7461      0.757      6.274      0.000       3.253       6.239
color         -0.7490      0.295     -2.542      0.012      -1.331      -0.167
==============================================================================
Omnibus:                       38.876   Durbin-Watson:                   1.780
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               59.793
Skew:                           1.207   Prob(JB):                     1.04e-13
Kurtosis:                       4.570   Cond. No.                         9.39
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Let’s treat color as a qualitative variable:

mod = smf.ols(formula="y ~ C(color)", data=crabs_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.023
Method:                 Least Squares   F-statistic:                     2.323
Date:                Tue, 12 Jan 2021   Prob (F-statistic):             0.0769
Time:                        23:01:17   Log-Likelihood:                -439.89
No. Observations:                 173   AIC:                             887.8
Df Residuals:                     169   BIC:                             900.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         4.0833      0.899      4.544      0.000       2.310       5.857
C(color)[T.2]    -0.7886      0.954     -0.827      0.409      -2.671       1.094
C(color)[T.3]    -1.8561      1.014     -1.831      0.069      -3.857       0.145
C(color)[T.4]    -2.0379      1.117     -1.824      0.070      -4.243       0.167
==============================================================================
Omnibus:                       37.294   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               55.871
Skew:                           1.179   Prob(JB):                     7.38e-13
Kurtosis:                       4.479   Cond. No.                         9.31
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This is equivalent to doing a GLM fit with a gaussian family and identity link:

formula = """y ~ C(color)"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_color = sm.GLM(response, predictors, family=sm.families.Gaussian()).fit()
print(fit_color.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      169
Model Family:                Gaussian   Df Model:                            3
Link Function:               identity   Scale:                          9.6884
Method:                          IRLS   Log-Likelihood:                -439.89
Date:                Tue, 12 Jan 2021   Deviance:                       1637.3
Time:                        23:01:18   Pearson chi2:                 1.64e+03
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         4.0833      0.899      4.544      0.000       2.322       5.844
C(color)[T.2]    -0.7886      0.954     -0.827      0.408      -2.658       1.080
C(color)[T.3]    -1.8561      1.014     -1.831      0.067      -3.843       0.131
C(color)[T.4]    -2.0379      1.117     -1.824      0.068      -4.227       0.151
=================================================================================

If we instead do a poisson fit:

formula = """y ~ C(color)"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_color2 = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.identity)
).fit()
print(fit_color2.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      169
Model Family:                 Poisson   Df Model:                            3
Link Function:               identity   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -482.22
Date:                Tue, 12 Jan 2021   Deviance:                       609.14
Time:                        23:01:18   Pearson chi2:                     584.
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         4.0833      0.583      7.000      0.000       2.940       5.227
C(color)[T.2]    -0.7886      0.612     -1.288      0.198      -1.989       0.412
C(color)[T.3]    -1.8561      0.625     -2.969      0.003      -3.081      -0.631
C(color)[T.4]    -2.0379      0.658     -3.096      0.002      -3.328      -0.748
=================================================================================

And we get the same estimates as when using Gaussian family with identity link! Because the ML estimates for the poisson distirbution is also the sample mean if there is a single predictor. But the standard values are much smaller. Because the errors here are heteroskedastic while the gaussian version assume homoskesdasticity.

1.5. Using both qualitative and quantitative variables¶

formula = """y ~ weight + C(color)"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_weight_color = sm.GLM(response, predictors, family=sm.families.Gaussian()).fit()
print(fit_weight_color.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      168
Model Family:                Gaussian   Df Model:                            4
Link Function:               identity   Scale:                          8.6370
Method:                          IRLS   Log-Likelihood:                -429.44
Date:                Tue, 12 Jan 2021   Deviance:                       1451.0
Time:                        23:01:18   Pearson chi2:                 1.45e+03
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.8232      1.355     -0.608      0.543      -3.479       1.832
C(color)[T.2]    -0.6181      0.901     -0.686      0.493      -2.384       1.148
C(color)[T.3]    -1.2404      0.966     -1.284      0.199      -3.134       0.653
C(color)[T.4]    -1.1882      1.070     -1.110      0.267      -3.286       0.910
weight            1.8662      0.402      4.645      0.000       1.079       2.654
=================================================================================

formula = """y ~ weight + C(color)"""
response, predictors = dmatrices(formula, crabs_df, return_type="dataframe")
fit_weight_color2 = sm.GLM(
    response, predictors, family=sm.families.Poisson(link=sm.families.links.identity())
).fit()
print(fit_weight_color2.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      168
Model Family:                 Poisson   Df Model:                            4
Link Function:               identity   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Tue, 12 Jan 2021   Deviance:                       534.33
Time:                        23:01:18   Pearson chi2:                     529.
No. Iterations:                   100                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.9930      0.736     -1.349      0.177      -2.436       0.450
C(color)[T.2]    -0.8442      0.615     -1.374      0.170      -2.049       0.360
C(color)[T.3]    -1.4320      0.629     -2.278      0.023      -2.664      -0.200
C(color)[T.4]    -1.2248      0.658     -1.861      0.063      -2.515       0.065
weight            2.0086      0.173     11.641      0.000       1.670       2.347
=================================================================================

Foundations of Linear and Generalized Linear Models with statsmodels