Contents

2. Chapter 2 - Linear Models: Least Squares Theory¶

import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"

Populating the interactive namespace from numpy and matplotlib

scots_races_df = pd.read_csv("../data/ScotsRaces.tsv.gz", sep="\t")
scots_races_df.head()

	race	distance	climb	time
1	GreenmantleNewYearDash	2.5	0.65	16.08
2	Carnethy5HillRace	6.0	2.50	48.35
3	CraigDunainHillRace	6.0	0.90	33.65
4	BenRhaHillRace	7.5	0.80	45.60
5	BenLomondHillRace	8.0	3.07	62.27

A dataset contains a list of hill races in Scotland for the year. Explanatory variables:

distance of the race (in miles)
the cumulative climb (in thousands of feet)

print(scots_races_df[["time", "climb", "distance"]].mean(axis=0))
print(scots_races_df[["time", "climb", "distance"]].std(axis=0))

time        56.089714
climb        1.815314
distance     7.528571
dtype: float64
time        50.392617
climb        1.619151
distance     5.523936
dtype: float64

sns.pairplot(scots_races_df[["time", "climb", "distance"]])

<seaborn.axisgrid.PairGrid at 0x7f14f9889510>

../_images/02_Chapter02_6_1.png

scots_races_df[["climb", "distance", "time"]].corr()

	climb	distance	time
climb	1.000000	0.652346	0.832654
distance	0.652346	1.000000	0.943094
time	0.832654	0.943094	1.000000

fit_cd = smf.ols(
    formula="""time ~ climb + distance""",
    data=scots_races_df[["climb", "distance", "time"]],
).fit()
print(fit_cd.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   time   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.970
Method:                 Least Squares   F-statistic:                     549.9
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           1.67e-25
Time:                        23:01:23   Log-Likelihood:                -123.95
No. Observations:                  35   AIC:                             253.9
Df Residuals:                      32   BIC:                             258.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -13.1086      2.561     -5.119      0.000     -18.325      -7.892
climb         11.7801      1.221      9.651      0.000       9.294      14.266
distance       6.3510      0.358     17.751      0.000       5.622       7.080
==============================================================================
Omnibus:                        7.207   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.027   Jarque-Bera (JB):                6.967
Skew:                           0.598   Prob(JB):                       0.0307
Kurtosis:                       4.829   Cond. No.                         16.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Thus, adjusted for climb, the predicted record time increased by 6.34 minutes for every additional midle of distance

sns.distplot(fit_cd.resid, kde=False, color="slateblue")

<AxesSubplot:>

../_images/02_Chapter02_10_1.png

pd.Series(fit_cd.resid_pearson).quantile(q=[0, 0.25, 0.5, 0.75, 1])

00   -1.906667
25   -0.554349
50    0.127085
75    0.534341
00    3.178498
dtype: float64

print(stats.pearsonr(fit_cd.fittedvalues, fit_cd.resid))

(3.755818658506266e-15, 0.9999999999999806)

print(fit_cd.resid_pearson.mean(), fit_cd.resid_pearson.std())

6.483702463810914e-15 0.9561828874675148

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(scots_races_df["distance"], fit_cd.resid_pearson)

<matplotlib.collections.PathCollection at 0x7f14f0e6b650>

../_images/02_Chapter02_14_1.png

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(fit_cd.fittedvalues, fit_cd.resid_pearson)

<matplotlib.collections.PathCollection at 0x7f14ee9f5710>

../_images/02_Chapter02_15_1.png

influence = fit_cd.get_influence()
summary_frame = influence.summary_frame()
fig, ax = plt.subplots()
ax.plot(summary_frame["cooks_d"])
ax.set_xlabel("Observation")
ax.set_ylabel("Cook's distance")

Text(0, 0.5, "Cook's distance")

../_images/02_Chapter02_16_1.png

2.1. ANOVA¶

lm = smf.ols(
    formula="""time ~ climb + distance""",
    data=scots_races_df[["climb", "distance", "time"]],
).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                sum_sq    df           F        PR(>F)
climb      7105.618848   1.0   93.139930  5.368815e-11
distance  24038.225642   1.0  315.091296  3.990140e-18
Residual   2441.270928  32.0         NaN           NaN

# Change order
lm = smf.ols(
    formula="""time ~ distance + climb""",
    data=scots_races_df[["climb", "distance", "time"]],
).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                sum_sq    df           F        PR(>F)
distance  24038.225642   1.0  315.091296  3.990140e-18
climb      7105.618848   1.0   93.139930  5.368815e-11
Residual   2441.270928  32.0         NaN           NaN

lm = smf.ols(
    formula="""time ~ climb + distance + climb:distance """,
    data=scots_races_df[["climb", "distance", "time"]],
).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(lm.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   time   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     524.1
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           1.24e-26
Time:                        23:01:24   Log-Likelihood:                -117.30
No. Observations:                  35   AIC:                             242.6
Df Residuals:                      31   BIC:                             248.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -0.7672      3.906     -0.196      0.846      -8.733       7.199
climb              3.7133      2.365      1.570      0.126      -1.110       8.536
distance           4.9623      0.474     10.464      0.000       3.995       5.929
climb:distance     0.6598      0.174      3.786      0.001       0.304       1.015
==============================================================================
Omnibus:                        9.175   Durbin-Watson:                   1.509
Prob(Omnibus):                  0.010   Jarque-Bera (JB):               14.024
Skew:                          -0.486   Prob(JB):                     0.000901
Kurtosis:                       5.945   Cond. No.                         128.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

1. Chapter 1 - Introduction to Linear and Generalized Linear Models 3. Chapter 3 - Normal Linear Models: Statistical Inference