Chapter 4 Linear regression: before and after fitting the model¶

import warnings

import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats

warnings.filterwarnings("ignore")
%pylab inline


plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"

Populating the interactive namespace from numpy and matplotlib

earnings_df = pd.read_csv("../data/earnings/heights.tsv.gz", sep="\t")
earnings_df.head()

	earn	height1	height2	sex	race	hisp	ed	yearbn	height
0	NaN	5.0	6.0	2	1	2	12	53	66.0
1	NaN	5.0	4.0	1	2	2	12	50	64.0
2	50000.0	6.0	2.0	1	1	2	16	45	74.0
3	60000.0	5.0	6.0	2	1	2	16	32	66.0
4	30000.0	5.0	4.0	2	1	2	16	61	64.0

earnings_df.shape

(2029, 9)

earnings_df.dropna().shape

(1379, 9)

earnings_df = earnings_df.dropna()
earnings_df.shape

(1379, 9)

earnings_df.head()

	earn	height1	height2	sex	race	hisp	ed	yearbn	height
2	50000.0	6.0	2.0	1	1	2	16	45	74.0
3	60000.0	5.0	6.0	2	1	2	16	32	66.0
4	30000.0	5.0	4.0	2	1	2	16	61	64.0
6	50000.0	5.0	3.0	2	3	2	16	99	63.0
8	51000.0	5.0	3.0	2	1	2	17	51	63.0

Centering and standardizing, especially for models with interactions¶

kidiq_df = pd.read_csv("../data/kidiq.tsv.gz", sep="\t")
kidiq_df.head()

	kid_score	mom_hs	mom_iq	mom_work	mom_age
0	65	1.0	121.117529	4	27
1	98	1.0	89.361882	4	25
2	85	1.0	115.443165	4	27
3	83	1.0	99.449639	3	25
4	115	1.0	92.745710	4	27

model = smf.ols(
    formula="""kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq""", data=kidiq_df
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     42.84
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.07e-24
Time:                        23:35:23   Log-Likelihood:                -1867.5
No. Observations:                 434   AIC:                             3743.
Df Residuals:                     430   BIC:                             3759.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       -11.4820     13.758     -0.835      0.404     -38.523      15.559
mom_hs           51.2682     15.338      3.343      0.001      21.122      81.414
mom_iq            0.9689      0.148      6.531      0.000       0.677       1.260
mom_hs:mom_iq    -0.4843      0.162     -2.985      0.003      -0.803      -0.165
==============================================================================
Omnibus:                        8.014   Durbin-Watson:                   1.660
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                8.258
Skew:                          -0.333   Prob(JB):                       0.0161
Kurtosis:                       2.887   Cond. No.                     3.10e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 17.91

Centering by subtracting the mean¶

kidiq_df["mom_hs_centered"] = kidiq_df["mom_hs"] - kidiq_df["mom_hs"].mean()
kidiq_df["mom_iq_centered"] = kidiq_df["mom_iq"] - kidiq_df["mom_iq"].mean()


model = smf.ols(
    formula="""kid_score ~ mom_hs_centered + mom_iq_centered + mom_hs_centered:mom_iq_centered""",
    data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     42.84
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.07e-24
Time:                        23:35:23   Log-Likelihood:                -1867.5
No. Observations:                 434   AIC:                             3743.
Df Residuals:                     430   BIC:                             3759.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept                          87.6389      0.908     96.565      0.000      85.855      89.423
mom_hs_centered                     2.8408      2.427      1.171      0.242      -1.929       7.610
mom_iq_centered                     0.5884      0.061      9.712      0.000       0.469       0.707
mom_hs_centered:mom_iq_centered    -0.4843      0.162     -2.985      0.003      -0.803      -0.165
==============================================================================
Omnibus:                        8.014   Durbin-Watson:                   1.660
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                8.258
Skew:                          -0.333   Prob(JB):                       0.0161
Kurtosis:                       2.887   Cond. No.                         42.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 17.91

The residual SD and $R^2$ do not change as a linear transformation does not affect the fit. Interpretation:

- Intercept (87): Score of children with  mothers who have an average IQ and have gone to a "mean high school" 
- Coeff of mom_hs (2.8): Score of children differes by 2.8 for children with mother's mean high school difference of 1 (i.e. went to school vs didn't) as long as they have the same mean IQ.
- Coeff of mom_iq (0.6): Score of children will differ by 0.6 for a 1 point difference in mother's IQ if they are at the same average level of "mean high school"
- Interaction effect: Difference in slope for mom_iq comparing children with mothers who did and did not complete 

Coefficient of each main effect corresponds to the predictive difference with the other input at its average value.

Using a conventional centering point¶

If we center on an understandable reference - the above values are more interpretable.

kidiq_df["mom_hs_centered2"] = kidiq_df["mom_hs"] - 0.5
kidiq_df["mom_iq_centered2"] = kidiq_df["mom_iq"] - 100


model = smf.ols(
    formula="""kid_score ~ mom_hs_centered2 + mom_iq_centered2 + mom_hs_centered2:mom_iq_centered2""",
    data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     42.84
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.07e-24
Time:                        23:35:23   Log-Likelihood:                -1867.5
No. Observations:                 434   AIC:                             3743.
Df Residuals:                     430   BIC:                             3759.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Intercept                            86.8273      1.213     71.561      0.000      84.442      89.212
mom_hs_centered2                      2.8408      2.427      1.171      0.242      -1.929       7.610
mom_iq_centered2                      0.7268      0.081      8.960      0.000       0.567       0.886
mom_hs_centered2:mom_iq_centered2    -0.4843      0.162     -2.985      0.003      -0.803      -0.165
==============================================================================
Omnibus:                        8.014   Durbin-Watson:                   1.660
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                8.258
Skew:                          -0.333   Prob(JB):                       0.0161
Kurtosis:                       2.887   Cond. No.                         46.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 17.91

Coefficient of mom_hs_centered2 is the average predictive difference between a child with mom.hs = 1 and mom.hs = 0 for those children with mom.iq = 100.
Coefficient of mom_iq_centered2 is the average predictive difference between a child with mom.iq = 0.5

Standardizing by subtracting the mean and dividing by 2 standard deviations¶

kidiq_df["mom_hs_centeredZ"] = (
    (kidiq_df["mom_hs"] - kidiq_df["mom_hs"].mean()) / 2 * kidiq_df["mom_hs"].std()
)
kidiq_df["mom_iq_centeredZ"] = (
    (kidiq_df["mom_iq"] - kidiq_df["mom_iq"].mean()) / 2 * kidiq_df["mom_iq"].std()
)


model = smf.ols(
    formula="""kid_score ~ mom_hs_centeredZ + mom_iq_centeredZ + mom_hs_centeredZ:mom_iq_centeredZ""",
    data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.225
Method:                 Least Squares   F-statistic:                     42.84
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.07e-24
Time:                        23:35:23   Log-Likelihood:                -1867.5
No. Observations:                 434   AIC:                             3743.
Df Residuals:                     430   BIC:                             3759.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Intercept                            87.6389      0.908     96.565      0.000      85.855      89.423
mom_hs_centeredZ                     13.8304     11.814      1.171      0.242      -9.391      37.051
mom_iq_centeredZ                      0.0785      0.008      9.712      0.000       0.063       0.094
mom_hs_centeredZ:mom_iq_centeredZ    -0.3144      0.105     -2.985      0.003      -0.521      -0.107
==============================================================================
Omnibus:                        8.014   Durbin-Watson:                   1.660
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                8.258
Skew:                          -0.333   Prob(JB):                       0.0161
Kurtosis:                       2.887   Cond. No.                     1.54e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 17.91

Why scale by 2 standard deviations?¶

If $x$ was a binary variable them its standard deviation is given by $\sqrt{0.5 \times 0.5}= 0.5 $. If we do not divide by 2, the difference between the extremities of the binary variable $(0.5/0.5) - (-0.5/0.5) = 2$ and hence the coefficient would correspond half the difference between the two possible values of x. While dividing by 2 will result in a difference of 1.

Logarithmic transformations¶

A linear model on the logarithmic scale corresponds to a multiplicative model on the original scale.

Height and earnings example¶

earnings_df["log_earn"] = np.log(earnings_df["earn"])
earnings_df = earnings_df.replace([np.inf, -np.inf], np.nan)

earnings_df = earnings_df.dropna()
earn_log_model = smf.ols(formula="""log_earn ~ height""", data=earnings_df).fit()
print(earn_log_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_earn   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     76.44
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           7.62e-18
Time:                        23:35:23   Log-Likelihood:                -1555.6
No. Observations:                1192   AIC:                             3115.
Df Residuals:                    1190   BIC:                             3125.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.7785      0.451     12.815      0.000       4.894       6.663
height         0.0588      0.007      8.743      0.000       0.046       0.072
==============================================================================
Omnibus:                      223.552   Durbin-Watson:                   1.904
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              457.878
Skew:                          -1.079   Prob(JB):                    3.74e-100
Kurtosis:                       5.136   Cond. No.                     1.17e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation¶

A coefficient of 0.06 on log scale for height implies a unit change in height will lead to $\exp{0.06}= 1.06$ difference in earnings. Thus a change of 1 unit in height leads to a change of 6% in earnings.

Why use a natural log instead of $log_{10}$¶

Coefficients on the natural-log scale are “more” interpretable. A coefficient of $0.06$ for example corresponds to a change of 6%. On the other hand, a change on $\log_{10}$ scale is harder to interpret:

earnings_df["log10_earn"] = np.log10(earnings_df["earn"])
earnings_df = earnings_df.replace([np.inf, -np.inf], np.nan)

earnings_df = earnings_df.dropna()
earn_log_model = smf.ols(formula="""log10_earn ~ height""", data=earnings_df).fit()
print(earn_log_model.summary())
print("Residual SD: {}".format(np.round(earn_log_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             log10_earn   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     76.44
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           7.62e-18
Time:                        23:35:23   Log-Likelihood:                -561.42
No. Observations:                1192   AIC:                             1127.
Df Residuals:                    1190   BIC:                             1137.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.5096      0.196     12.815      0.000       2.125       2.894
height         0.0255      0.003      8.743      0.000       0.020       0.031
==============================================================================
Omnibus:                      223.552   Durbin-Watson:                   1.904
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              457.878
Skew:                          -1.079   Prob(JB):                    3.74e-100
Kurtosis:                       5.136   Cond. No.                     1.17e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.39

A coefficient of $0.0255$ is less interpretable here as it still corresponds to a change of $10^{0.0255} = 1.062$

Building a regression model on the log scale¶

earnings_df["male"] = 1
# Set rows where sex==2 to female (0)
earnings_df.loc[earnings_df.sex == 2, "male"] = 0
earn_log_model = smf.ols(formula="""log_earn ~ height + male""", data=earnings_df).fit()
print(earn_log_model.summary())
print("Residual SD: {}".format(np.round(earn_log_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_earn   R-squared:                       0.087
Model:                            OLS   Adj. R-squared:                  0.085
Method:                 Least Squares   F-statistic:                     56.34
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           4.21e-24
Time:                        23:35:23   Log-Likelihood:                -1538.7
No. Observations:                1192   AIC:                             3083.
Df Residuals:                    1189   BIC:                             3099.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.1527      0.603     13.530      0.000       6.970       9.335
height         0.0207      0.009      2.218      0.027       0.002       0.039
male           0.4232      0.072      5.841      0.000       0.281       0.565
==============================================================================
Omnibus:                      225.577   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              462.608
Skew:                          -1.088   Prob(JB):                    3.52e-101
Kurtosis:                       5.141   Cond. No.                     1.59e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.88

Including an interaction¶

earnings_df["male"] = 1
# Set rows where sex==2 to female (0)
earnings_df.loc[earnings_df.sex == 2, "male"] = 0
earn_log_model3 = smf.ols(
    formula="""log_earn ~ height + male + height:male""", data=earnings_df
).fit()
print(earn_log_model3.summary())
print("Residual SD: {}".format(np.round(earn_log_model3.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_earn   R-squared:                       0.087
Model:                            OLS   Adj. R-squared:                  0.084
Method:                 Least Squares   F-statistic:                     37.58
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.32e-23
Time:                        23:35:23   Log-Likelihood:                -1538.6
No. Observations:                1192   AIC:                             3085.
Df Residuals:                    1188   BIC:                             3106.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       8.3885      0.844      9.945      0.000       6.734      10.043
height          0.0170      0.013      1.304      0.193      -0.009       0.043
male           -0.0786      1.258     -0.062      0.950      -2.546       2.389
height:male     0.0074      0.019      0.400      0.690      -0.029       0.044
==============================================================================
Omnibus:                      225.961   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              463.655
Skew:                          -1.089   Prob(JB):                    2.08e-101
Kurtosis:                       5.143   Cond. No.                     4.16e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.88

Interpretation:

- Intercept = Predicted log earning when height and male is zero (which is non feasible hence not interpretable)
- Coeff of height (0.017) = Predicted difference in log earnings corresponding to a 1-inch difference in height if male = 0 => The estimated predictive difference in earnings for a 1-inch difference in height is 1.7% for women.
- Coeff of male (-0.0786): Predicted difference in log earnings between women and men if height equals 0. But heights are never zero so this is not interpretable.
- Interaction term height:male (0.0074): Predictive difference in log earnings comparing men to women => An inch of height difference corresponds to 0.7% more of an in increase in earnings among men than among women and the estimated predictive difference in log earnings is 1.7% + 0.7% = 2.4% per inch of height in men.

Linear transformation to make coefficients more interpretable:¶

earnings_df["height_z"] = (
    earnings_df["height"] - earnings_df["height"].mean()
) / earnings_df["height"].std()
earn_log_model4 = smf.ols(
    formula="""log_earn ~ height_z + male + height_z:male""", data=earnings_df
).fit()
print(earn_log_model4.summary())
print("Residual SD: {}".format(np.round(earn_log_model4.resid.std(), 2)))
print(earnings_df["height"].mean(), earnings_df["height"].std())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_earn   R-squared:                       0.087
Model:                            OLS   Adj. R-squared:                  0.084
Method:                 Least Squares   F-statistic:                     37.58
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           3.32e-23
Time:                        23:35:23   Log-Likelihood:                -1538.6
No. Observations:                1192   AIC:                             3085.
Df Residuals:                    1188   BIC:                             3106.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         9.5266      0.045    210.878      0.000       9.438       9.615
height_z          0.0654      0.050      1.304      0.193      -0.033       0.164
male              0.4197      0.073      5.748      0.000       0.276       0.563
height_z:male     0.0286      0.072      0.400      0.690      -0.112       0.169
==============================================================================
Omnibus:                      225.961   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              463.655
Skew:                          -1.089   Prob(JB):                    2.08e-101
Kurtosis:                       5.143   Cond. No.                         4.64
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.88
66.91694630872483 3.846637848956869

Interpretation:

- Intercept = Predicted log earning when height is average height and male is zero (which is non feasible hence not interpretable) => 66.9 inch women has a log earnings of 9.52
- Coeff of height (0.065) = Predicted difference in log earnings corresponding to a 1-inch standard deviation in height if male = 0 => The estimated predictive difference in earnings for a 3.8 inch difference in height of females is 6% for women.
- Coeff of male (0.4197): Predicted difference in log earnings between women and men if height equals 66.9 inches. Thus, males of height 66.9 inches are expected to have 41.9% higher log earning than women
- Interaction term height:male (0.0286): Predictive difference in log earnings comparing men to women => A 2.8 inch of height difference corresponds to 2.8% more of an in increase in earnings among men than among women and the estimated predictive difference in log earnings is 41.9% + 2.7% = 44.6% per inch of height in men.

Using discrete rather than continous predictors¶

kid_score_momwork = smf.ols(formula="""kid_score ~ C(mom_work)""", data=kidiq_df).fit()
print(kid_score_momwork.summary())
print("Residual SD: {}".format(np.round(kid_score_momwork.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.024
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     3.590
Date:                Sat, 20 Jun 2020   Prob (F-statistic):             0.0138
Time:                        23:35:24   Log-Likelihood:                -1918.9
No. Observations:                 434   AIC:                             3846.
Df Residuals:                     430   BIC:                             3862.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           82.0000      2.305     35.568      0.000      77.469      86.531
C(mom_work)[T.2]     3.8542      3.095      1.245      0.214      -2.229       9.937
C(mom_work)[T.3]    11.5000      3.553      3.237      0.001       4.517      18.483
C(mom_work)[T.4]     5.2098      2.704      1.927      0.055      -0.105      10.524
==============================================================================
Omnibus:                       12.371   Durbin-Watson:                   1.477
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               12.950
Skew:                          -0.411   Prob(JB):                      0.00154
Kurtosis:                       2.796   Cond. No.                         5.92
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 20.16

Building regression models for prediction¶

mesquite_df = pd.read_csv("../data/mesquite/mesquite.tsv.gz", sep="\t")
mesquite_df.head()

	Obs	Group	Diam1	Diam2	TotHt	CanHt	Dens	LeafWt
0	1	MCD	1.8	1.15	1.30	1.00	1	401.3
1	2	MCD	1.7	1.35	1.35	1.33	1	513.7
2	3	MCD	2.8	2.55	2.16	0.60	1	1179.2
3	4	MCD	1.3	0.85	1.80	1.20	1	308.0
4	5	MCD	3.3	1.90	1.55	1.05	1	855.2

formula = """LeafWt ~ Diam1 + Diam2 + CanHt + TotHt + Dens + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 LeafWt   R-squared:                       0.848
Model:                            OLS   Adj. R-squared:                  0.825
Method:                 Least Squares   F-statistic:                     36.34
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           1.73e-14
Time:                        23:35:24   Log-Likelihood:                -318.82
No. Observations:                  46   AIC:                             651.6
Df Residuals:                      39   BIC:                             664.4
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -1091.8880    176.456     -6.188      0.000   -1448.804    -734.972
Group[T.MCD]   363.2951    100.184      3.626      0.001     160.654     565.936
Diam1          189.6690    112.760      1.682      0.101     -38.410     417.748
Diam2          371.4621    124.378      2.987      0.005     119.883     623.041
CanHt          355.6653    209.843      1.695      0.098     -68.782     780.113
TotHt         -101.7325    185.574     -0.548      0.587    -477.091     273.626
Dens           131.2542     34.355      3.820      0.000      61.764     200.744
==============================================================================
Omnibus:                        9.865   Durbin-Watson:                   2.137
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               19.723
Skew:                           0.366   Prob(JB):                     5.21e-05
Kurtosis:                       6.123   Cond. No.                         26.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 250.39

Multiplicative Model¶

formula = """log(LeafWt) ~ log(Diam1) + log(Diam2) + log(CanHt) + log(TotHt) + log(Dens) + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log(LeafWt)   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     51.17
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           5.73e-17
Time:                        23:35:24   Log-Likelihood:                -10.406
No. Observations:                  46   AIC:                             34.81
Df Residuals:                      39   BIC:                             47.61
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.7680      0.155     30.747      0.000       4.454       5.082
Group[T.MCD]     0.5834      0.129      4.534      0.000       0.323       0.844
log(Diam1)       0.3938      0.282      1.397      0.170      -0.177       0.964
log(Diam2)       1.1512      0.210      5.477      0.000       0.726       1.576
log(CanHt)       0.3732      0.281      1.330      0.191      -0.194       0.941
log(TotHt)       0.3943      0.313      1.260      0.215      -0.239       1.027
log(Dens)        0.1093      0.122      0.896      0.376      -0.137       0.356
==============================================================================
Omnibus:                        1.935   Durbin-Watson:                   1.362
Prob(Omnibus):                  0.380   Jarque-Bera (JB):                1.047
Skew:                          -0.283   Prob(JB):                        0.593
Kurtosis:                       3.474   Cond. No.                         12.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.31

Interpretation => A difference of x% in canopy height leads to a difference of 0.37x% difference in leaf weight.

The above model is a complicated one. Maybe we can start with something very simple.

mesquite_df["CanVol"] = mesquite_df['Diam1'] * mesquite_df['Diam2'] * mesquite_df['CanHt'] 
formula = """log(LeafWt) ~ log(CanVol)"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log(LeafWt)   R-squared:                       0.799
Model:                            OLS   Adj. R-squared:                  0.795
Method:                 Least Squares   F-statistic:                     175.1
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           6.09e-17
Time:                        23:35:24   Log-Likelihood:                -23.687
No. Observations:                  46   AIC:                             51.37
Df Residuals:                      44   BIC:                             55.03
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       5.1697      0.083     62.066      0.000       5.002       5.338
log(CanVol)     0.7224      0.055     13.234      0.000       0.612       0.832
==============================================================================
Omnibus:                        2.765   Durbin-Watson:                   0.996
Prob(Omnibus):                  0.251   Jarque-Bera (JB):                2.066
Skew:                          -0.515   Prob(JB):                        0.356
Kurtosis:                       3.125   Cond. No.                         2.59
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.41

So leaf weight is approximately proportional to 0.72th power to Canopy Volume. It has a impressive $R^2$ of 0.8

mesquite_df["CanVol"] = mesquite_df['Diam1'] * mesquite_df['Diam2'] * mesquite_df['CanHt'] 
mesquite_df["CanArea"] = mesquite_df['Diam1'] * mesquite_df['Diam2']
mesquite_df["Shape"] = mesquite_df['Diam1'] / mesquite_df['Diam2']

formula = """log(LeafWt) ~ log(CanVol) + log(CanArea) + log(Shape) + log(TotHt) + log(Dens) + Group"""

mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log(LeafWt)   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     51.17
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           5.73e-17
Time:                        23:35:24   Log-Likelihood:                -10.406
No. Observations:                  46   AIC:                             34.81
Df Residuals:                      39   BIC:                             47.61
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.7680      0.155     30.747      0.000       4.454       5.082
Group[T.MCD]     0.5834      0.129      4.534      0.000       0.323       0.844
log(CanVol)      0.3732      0.281      1.330      0.191      -0.194       0.941
log(CanArea)     0.3993      0.294      1.357      0.183      -0.196       0.994
log(Shape)      -0.3787      0.231     -1.642      0.109      -0.845       0.088
log(TotHt)       0.3943      0.313      1.260      0.215      -0.239       1.027
log(Dens)        0.1093      0.122      0.896      0.376      -0.137       0.356
==============================================================================
Omnibus:                        1.935   Durbin-Watson:                   1.362
Prob(Omnibus):                  0.380   Jarque-Bera (JB):                1.047
Skew:                          -0.283   Prob(JB):                        0.593
Kurtosis:                       3.474   Cond. No.                         22.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.31

formula = """log(LeafWt) ~ log(CanVol) + log(CanArea) + Group"""

mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log(LeafWt)   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.864
Method:                 Least Squares   F-statistic:                     96.01
Date:                Sat, 20 Jun 2020   Prob (F-statistic):           7.79e-19
Time:                        23:35:24   Log-Likelihood:                -13.198
No. Observations:                  46   AIC:                             34.40
Df Residuals:                      42   BIC:                             41.71
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.6975      0.118     39.812      0.000       4.459       4.936
Group[T.MCD]     0.5269      0.116      4.558      0.000       0.294       0.760
log(CanVol)      0.6148      0.191      3.215      0.003       0.229       1.001
log(CanArea)     0.2889      0.238      1.215      0.231      -0.191       0.768
==============================================================================
Omnibus:                        1.398   Durbin-Watson:                   1.282
Prob(Omnibus):                  0.497   Jarque-Bera (JB):                0.694
Skew:                          -0.261   Prob(JB):                        0.707
Kurtosis:                       3.300   Cond. No.                         13.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.33

Chapter 3 - Linear regression: the basics

Data Analysis Using Regression and Multilevel/Hierarchjial Models - Andrew Gelman & Jennifer Hill