Open In Colab

3. Chapter 3 - Normal Linear Models: Statistical Inference

import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"
Populating the interactive namespace from numpy and matplotlib
stats.f.ppf(0.95, 2, 27)  # 0.95 quantile of F dist. with df1=2, df2 =27
3.3541308285291986
1 - stats.f.pdf(3.364131, 2, 27)  # right tail probability of non-central F
0.9602934531255635
house_df = pd.read_csv("../data/Houses.tsv.gz", sep="\t")
house_df
case taxes beds baths new price size
0 1 3104 4 2 0 279.9 2048
1 2 1173 2 1 0 146.5 912
2 3 3076 4 2 0 237.7 1654
3 4 1608 3 2 0 200.0 2068
4 5 1454 3 3 0 159.9 1477
... ... ... ... ... ... ... ...
95 96 990 2 2 0 176.0 1060
96 97 3030 3 2 0 196.5 1730
97 98 1580 3 2 0 132.2 1370
98 99 1770 3 2 0 88.4 1560
99 100 1430 3 2 0 127.2 1340

100 rows × 7 columns

Data Description: Observations on recent homesales in Gainesville, Florida.

pd.concat([house_df.mean(), house_df.std()], axis=1).rename(
    columns={0: "mean", 1: "sd"}
)
mean sd
case 50.500 29.011492
taxes 1908.390 1235.825663
beds 3.000 0.651339
baths 1.960 0.567112
new 0.110 0.314466
price 155.331 101.262213
size 1629.280 666.941702
fit = smf.ols(formula="""price ~ size + new""", data=house_df).fit()
print(fit.summary())
model_residuals = fit.resid
sd_model_residuals = model_residuals.std()
print("Residual SE: {}".format(np.round(sd_model_residuals, 2)))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.723
Model:                            OLS   Adj. R-squared:                  0.717
Method:                 Least Squares   F-statistic:                     126.3
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           9.79e-28
Time:                        23:01:27   Log-Likelihood:                -539.05
No. Observations:                 100   AIC:                             1084.
Df Residuals:                      97   BIC:                             1092.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -40.2309     14.696     -2.738      0.007     -69.399     -11.063
size           0.1161      0.009     13.204      0.000       0.099       0.134
new           57.7363     18.653      3.095      0.003      20.715      94.757
==============================================================================
Omnibus:                       12.906   Durbin-Watson:                   1.483
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               29.895
Skew:                           0.370   Prob(JB):                     3.22e-07
Kurtosis:                       5.574   Cond. No.                     6.32e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SE: 53.33

Interpretation:

- $H_0: \beta_1 = \beta_2 = 0$ The F-statistic with df1 = 2, df2 = 100-3 = 97 is  126.3 with a low p-value which is not surprising since the null is too strict. In some sense the probability of both the coefficients of size and new being 0 is low.

- Coeff of new: Having "adjusted" for size the coefficient of new is still significant

95%CI for the mean selling price of new homes at the mean size of the new homes, 2354.73 square feet:

prediction = fit.get_prediction({"size": 2354.73, "new": 1})
prediction.summary_frame()
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 290.963953 16.245718 258.720701 323.207205 179.270051 402.657856

Interpretation: If the model truly holds, the ‘mean_ci_lower’ and ‘mean_ci_upper’ reflects the 95% confidence interval where the population mean will lie. While ‘obs_ci_lower’ and ‘obs_ci_upper’ reflects the 95% CI where the future (predicted) will lie.

3.1. TODO: Derivation of the CI for mean and future y

Also see CrossValidated

fig, ax = plt.subplots()
ax.scatter(fit.fittedvalues, fit.get_influence().resid_studentized_internal)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Studentized Residual")
ax.set_title("Figure 3.4")
fig.tight_layout()
../_images/03_Chapter03_14_0.png
fig, ax = plt.subplots()
ax.scatter(house_df["size"], fit.get_influence().resid_studentized_internal)
ax.set_xlabel("Size")
ax.set_ylabel("Studentized Residual")
fig.tight_layout()
../_images/03_Chapter03_15_0.png
cooks_distance = fit.get_influence().cooks_distance[0]
plt.plot(cooks_distance)
[<matplotlib.lines.Line2D at 0x7fe63ac0eb50>]
../_images/03_Chapter03_16_1.png

Observation 64 seems influential. Let’s remove it and refit.

fit = smf.ols(
    formula="""price ~ size + new""", data=house_df.loc[cooks_distance < 1]
).fit()
print(fit.summary())
model_residuals = fit.resid
sd_model_residuals = model_residuals.std()
print("Residual SE: {}".format(np.round(sd_model_residuals, 2)))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.772
Model:                            OLS   Adj. R-squared:                  0.767
Method:                 Least Squares   F-statistic:                     162.5
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           1.52e-31
Time:                        23:01:28   Log-Likelihood:                -524.21
No. Observations:                  99   AIC:                             1054.
Df Residuals:                      96   BIC:                             1062.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -63.1545     14.252     -4.431      0.000     -91.444     -34.865
size           0.1328      0.009     15.138      0.000       0.115       0.150
new           41.3062     17.327      2.384      0.019       6.913      75.700
==============================================================================
Omnibus:                        8.263   Durbin-Watson:                   1.480
Prob(Omnibus):                  0.016   Jarque-Bera (JB):                7.876
Skew:                           0.640   Prob(JB):                       0.0195
Kurtosis:                       3.520   Cond. No.                     6.42e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SE: 48.48

\(R^2\) sees a considerable improvement. Next we check for an interaction:

fit = smf.ols(formula="""price ~ size + new + size:new""", data=house_df.loc[cooks_distance<1]).fit()
print(fit.summary())
model_residuals = fit.resid
sd_model_residuals = model_residuals.std()
print("Residual SE: {}".format(np.round(sd_model_residuals, 2)))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.782
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     113.7
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           2.52e-31
Time:                        23:01:28   Log-Likelihood:                -521.95
No. Observations:                  99   AIC:                             1052.
Df Residuals:                      95   BIC:                             1062.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -48.2431     15.686     -3.075      0.003     -79.385     -17.102
size           0.1230      0.010     12.536      0.000       0.104       0.142
new          -52.5122     47.630     -1.102      0.273    -147.070      42.046
size:new       0.0434      0.021      2.109      0.038       0.003       0.084
==============================================================================
Omnibus:                       14.139   Durbin-Watson:                   1.508
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               16.134
Skew:                           0.807   Prob(JB):                     0.000314
Kurtosis:                       4.144   Cond. No.                     1.76e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.76e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SE: 47.39

The increase in \(R^2\) is minimal.

3.2. Simpson’s paradox

Consider adding beds as another explanatory variable.

stats.pearsonr(house_df['beds'], house_df['price'])
(0.39395702205835825, 5.006768706208423e-05)

Thus, beds is modestly correlated with price. Let’s include it in the model and see how the fit changes.

fit = smf.ols(formula="""price ~ size + new""", data=house_df).fit()
print(fit.summary())
model_residuals = fit.resid
sd_model_residuals = model_residuals.std()
print("Residual SE: {}".format(np.round(sd_model_residuals, 2)))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.723
Model:                            OLS   Adj. R-squared:                  0.717
Method:                 Least Squares   F-statistic:                     126.3
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           9.79e-28
Time:                        23:01:28   Log-Likelihood:                -539.05
No. Observations:                 100   AIC:                             1084.
Df Residuals:                      97   BIC:                             1092.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -40.2309     14.696     -2.738      0.007     -69.399     -11.063
size           0.1161      0.009     13.204      0.000       0.099       0.134
new           57.7363     18.653      3.095      0.003      20.715      94.757
==============================================================================
Omnibus:                       12.906   Durbin-Watson:                   1.483
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               29.895
Skew:                           0.370   Prob(JB):                     3.22e-07
Kurtosis:                       5.574   Cond. No.                     6.32e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SE: 53.33
fit = smf.ols(formula="""price ~ size + new + beds""", data=house_df).fit()
print(fit.summary())
model_residuals = fit.resid
sd_model_residuals = model_residuals.std()
print("Residual SE: {}".format(np.round(sd_model_residuals, 2)))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.724
Model:                            OLS   Adj. R-squared:                  0.715
Method:                 Least Squares   F-statistic:                     83.97
Date:                Tue, 12 Jan 2021   Prob (F-statistic):           9.68e-27
Time:                        23:01:28   Log-Likelihood:                -538.78
No. Observations:                 100   AIC:                             1086.
Df Residuals:                      96   BIC:                             1096.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -25.1998     25.602     -0.984      0.327     -76.020      25.620
size           0.1205      0.011     11.229      0.000       0.099       0.142
new           54.8996     19.113      2.872      0.005      16.961      92.838
beds          -7.2927     10.159     -0.718      0.475     -27.458      12.872
==============================================================================
Omnibus:                       13.929   Durbin-Watson:                   1.496
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               41.568
Skew:                           0.277   Prob(JB):                     9.41e-10
Kurtosis:                       6.110   Cond. No.                     8.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SE: 53.19

Including beds leads to a reduced \(R^2\)!! Thus, once size and new are in the model, it does not help to add beds. Although the marginal effect of beds is positive (as shown by the modest positive correlation), the estimated partial effect of beds is negative. This illustrates Simpson’s paradox.

3.3. Partial Correlation

regress_sizenew_on_price = smf.ols(formula="""price ~ size + new""", data=house_df).fit()
regress_sizenew_on_bed = smf.ols(formula="""beds ~ size + new""", data=house_df).fit()

stats.pearsonr(regress_sizenew_on_price.resid, regress_sizenew_on_bed.resid)
(-0.07307200731424245, 0.4699872574144025)

Partial correlation between selling price and beds. Found using 1) finding the residuals for predicting selling price using size and new 2) finding the residual for predicting beds using size and new and 3) correlating the residuals in 1 and 2.

3.4. TODO: Testing contrats as a general linear hypothesis

from statsmodels.stats.libqsturng import psturng, qsturng

qsturng(0.95, 10, 190)
4.52786052558441
stats.t.ppf(1-0.05/(2*45), 190)
3.3113791139604563