Chapter 4 Linear regression: before and after fitting the model¶
import warnings
import pandas as pd
import proplot as plot
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
warnings.filterwarnings("ignore")
%pylab inline
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["font.weight"] = "bold"
Populating the interactive namespace from numpy and matplotlib
earnings_df = pd.read_csv("../data/earnings/heights.tsv.gz", sep="\t")
earnings_df.head()
earn | height1 | height2 | sex | race | hisp | ed | yearbn | height | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | 5.0 | 6.0 | 2 | 1 | 2 | 12 | 53 | 66.0 |
1 | NaN | 5.0 | 4.0 | 1 | 2 | 2 | 12 | 50 | 64.0 |
2 | 50000.0 | 6.0 | 2.0 | 1 | 1 | 2 | 16 | 45 | 74.0 |
3 | 60000.0 | 5.0 | 6.0 | 2 | 1 | 2 | 16 | 32 | 66.0 |
4 | 30000.0 | 5.0 | 4.0 | 2 | 1 | 2 | 16 | 61 | 64.0 |
earnings_df.shape
(2029, 9)
earnings_df.dropna().shape
(1379, 9)
earnings_df = earnings_df.dropna()
earnings_df.shape
(1379, 9)
earnings_df.head()
earn | height1 | height2 | sex | race | hisp | ed | yearbn | height | |
---|---|---|---|---|---|---|---|---|---|
2 | 50000.0 | 6.0 | 2.0 | 1 | 1 | 2 | 16 | 45 | 74.0 |
3 | 60000.0 | 5.0 | 6.0 | 2 | 1 | 2 | 16 | 32 | 66.0 |
4 | 30000.0 | 5.0 | 4.0 | 2 | 1 | 2 | 16 | 61 | 64.0 |
6 | 50000.0 | 5.0 | 3.0 | 2 | 3 | 2 | 16 | 99 | 63.0 |
8 | 51000.0 | 5.0 | 3.0 | 2 | 1 | 2 | 17 | 51 | 63.0 |
Centering and standardizing, especially for models with interactions¶
kidiq_df = pd.read_csv("../data/kidiq.tsv.gz", sep="\t")
kidiq_df.head()
kid_score | mom_hs | mom_iq | mom_work | mom_age | |
---|---|---|---|---|---|
0 | 65 | 1.0 | 121.117529 | 4 | 27 |
1 | 98 | 1.0 | 89.361882 | 4 | 25 |
2 | 85 | 1.0 | 115.443165 | 4 | 27 |
3 | 83 | 1.0 | 99.449639 | 3 | 25 |
4 | 115 | 1.0 | 92.745710 | 4 | 27 |
model = smf.ols(
formula="""kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq""", data=kidiq_df
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: kid_score R-squared: 0.230
Model: OLS Adj. R-squared: 0.225
Method: Least Squares F-statistic: 42.84
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.07e-24
Time: 23:35:23 Log-Likelihood: -1867.5
No. Observations: 434 AIC: 3743.
Df Residuals: 430 BIC: 3759.
Df Model: 3
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -11.4820 13.758 -0.835 0.404 -38.523 15.559
mom_hs 51.2682 15.338 3.343 0.001 21.122 81.414
mom_iq 0.9689 0.148 6.531 0.000 0.677 1.260
mom_hs:mom_iq -0.4843 0.162 -2.985 0.003 -0.803 -0.165
==============================================================================
Omnibus: 8.014 Durbin-Watson: 1.660
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.258
Skew: -0.333 Prob(JB): 0.0161
Kurtosis: 2.887 Cond. No. 3.10e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 17.91
Centering by subtracting the mean¶
kidiq_df["mom_hs_centered"] = kidiq_df["mom_hs"] - kidiq_df["mom_hs"].mean()
kidiq_df["mom_iq_centered"] = kidiq_df["mom_iq"] - kidiq_df["mom_iq"].mean()
model = smf.ols(
formula="""kid_score ~ mom_hs_centered + mom_iq_centered + mom_hs_centered:mom_iq_centered""",
data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: kid_score R-squared: 0.230
Model: OLS Adj. R-squared: 0.225
Method: Least Squares F-statistic: 42.84
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.07e-24
Time: 23:35:23 Log-Likelihood: -1867.5
No. Observations: 434 AIC: 3743.
Df Residuals: 430 BIC: 3759.
Df Model: 3
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 87.6389 0.908 96.565 0.000 85.855 89.423
mom_hs_centered 2.8408 2.427 1.171 0.242 -1.929 7.610
mom_iq_centered 0.5884 0.061 9.712 0.000 0.469 0.707
mom_hs_centered:mom_iq_centered -0.4843 0.162 -2.985 0.003 -0.803 -0.165
==============================================================================
Omnibus: 8.014 Durbin-Watson: 1.660
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.258
Skew: -0.333 Prob(JB): 0.0161
Kurtosis: 2.887 Cond. No. 42.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 17.91
The residual SD and \(R^2\) do not change as a linear transformation does not affect the fit. Interpretation:
- Intercept (87): Score of children with mothers who have an average IQ and have gone to a "mean high school"
- Coeff of mom_hs (2.8): Score of children differes by 2.8 for children with mother's mean high school difference of 1 (i.e. went to school vs didn't) as long as they have the same mean IQ.
- Coeff of mom_iq (0.6): Score of children will differ by 0.6 for a 1 point difference in mother's IQ if they are at the same average level of "mean high school"
- Interaction effect: Difference in slope for mom_iq comparing children with mothers who did and did not complete
Coefficient of each main effect corresponds to the predictive difference with the other input at its average value.
Using a conventional centering point¶
If we center on an understandable reference - the above values are more interpretable.
kidiq_df["mom_hs_centered2"] = kidiq_df["mom_hs"] - 0.5
kidiq_df["mom_iq_centered2"] = kidiq_df["mom_iq"] - 100
model = smf.ols(
formula="""kid_score ~ mom_hs_centered2 + mom_iq_centered2 + mom_hs_centered2:mom_iq_centered2""",
data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: kid_score R-squared: 0.230
Model: OLS Adj. R-squared: 0.225
Method: Least Squares F-statistic: 42.84
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.07e-24
Time: 23:35:23 Log-Likelihood: -1867.5
No. Observations: 434 AIC: 3743.
Df Residuals: 430 BIC: 3759.
Df Model: 3
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
Intercept 86.8273 1.213 71.561 0.000 84.442 89.212
mom_hs_centered2 2.8408 2.427 1.171 0.242 -1.929 7.610
mom_iq_centered2 0.7268 0.081 8.960 0.000 0.567 0.886
mom_hs_centered2:mom_iq_centered2 -0.4843 0.162 -2.985 0.003 -0.803 -0.165
==============================================================================
Omnibus: 8.014 Durbin-Watson: 1.660
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.258
Skew: -0.333 Prob(JB): 0.0161
Kurtosis: 2.887 Cond. No. 46.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 17.91
Coefficient of mom_hs_centered2 is the average predictive difference between a child with mom.hs = 1 and mom.hs = 0 for those children with mom.iq = 100.
Coefficient of mom_iq_centered2 is the average predictive difference between a child with mom.iq = 0.5
Standardizing by subtracting the mean and dividing by 2 standard deviations¶
kidiq_df["mom_hs_centeredZ"] = (
(kidiq_df["mom_hs"] - kidiq_df["mom_hs"].mean()) / 2 * kidiq_df["mom_hs"].std()
)
kidiq_df["mom_iq_centeredZ"] = (
(kidiq_df["mom_iq"] - kidiq_df["mom_iq"].mean()) / 2 * kidiq_df["mom_iq"].std()
)
model = smf.ols(
formula="""kid_score ~ mom_hs_centeredZ + mom_iq_centeredZ + mom_hs_centeredZ:mom_iq_centeredZ""",
data=kidiq_df,
)
results_hs = model.fit()
print(results_hs.summary())
print("Residual SD: {}".format(np.round(results_hs.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: kid_score R-squared: 0.230
Model: OLS Adj. R-squared: 0.225
Method: Least Squares F-statistic: 42.84
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.07e-24
Time: 23:35:23 Log-Likelihood: -1867.5
No. Observations: 434 AIC: 3743.
Df Residuals: 430 BIC: 3759.
Df Model: 3
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
Intercept 87.6389 0.908 96.565 0.000 85.855 89.423
mom_hs_centeredZ 13.8304 11.814 1.171 0.242 -9.391 37.051
mom_iq_centeredZ 0.0785 0.008 9.712 0.000 0.063 0.094
mom_hs_centeredZ:mom_iq_centeredZ -0.3144 0.105 -2.985 0.003 -0.521 -0.107
==============================================================================
Omnibus: 8.014 Durbin-Watson: 1.660
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.258
Skew: -0.333 Prob(JB): 0.0161
Kurtosis: 2.887 Cond. No. 1.54e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 17.91
Why scale by 2 standard deviations?¶
If \(x\) was a binary variable them its standard deviation is given by $\sqrt{0.5 \times 0.5}= 0.5 $. If we do not divide by 2, the difference between the extremities of the binary variable \((0.5/0.5) - (-0.5/0.5) = 2\) and hence the coefficient would correspond half the difference between the two possible values of x. While dividing by 2 will result in a difference of 1.
Logarithmic transformations¶
A linear model on the logarithmic scale corresponds to a multiplicative model on the original scale.
Height and earnings example¶
earnings_df["log_earn"] = np.log(earnings_df["earn"])
earnings_df = earnings_df.replace([np.inf, -np.inf], np.nan)
earnings_df = earnings_df.dropna()
earn_log_model = smf.ols(formula="""log_earn ~ height""", data=earnings_df).fit()
print(earn_log_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: log_earn R-squared: 0.060
Model: OLS Adj. R-squared: 0.060
Method: Least Squares F-statistic: 76.44
Date: Sat, 20 Jun 2020 Prob (F-statistic): 7.62e-18
Time: 23:35:23 Log-Likelihood: -1555.6
No. Observations: 1192 AIC: 3115.
Df Residuals: 1190 BIC: 3125.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.7785 0.451 12.815 0.000 4.894 6.663
height 0.0588 0.007 8.743 0.000 0.046 0.072
==============================================================================
Omnibus: 223.552 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 457.878
Skew: -1.079 Prob(JB): 3.74e-100
Kurtosis: 5.136 Cond. No. 1.17e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Interpretation¶
A coefficient of 0.06 on log scale for height implies a unit change in height will lead to \(\exp{0.06}= 1.06\) difference in earnings. Thus a change of 1 unit in height leads to a change of 6% in earnings.
Why use a natural log instead of \(log_{10}\)¶
Coefficients on the natural-log scale are “more” interpretable. A coefficient of \(0.06\) for example corresponds to a change of 6%. On the other hand, a change on \(\log_{10}\) scale is harder to interpret:
earnings_df["log10_earn"] = np.log10(earnings_df["earn"])
earnings_df = earnings_df.replace([np.inf, -np.inf], np.nan)
earnings_df = earnings_df.dropna()
earn_log_model = smf.ols(formula="""log10_earn ~ height""", data=earnings_df).fit()
print(earn_log_model.summary())
print("Residual SD: {}".format(np.round(earn_log_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log10_earn R-squared: 0.060
Model: OLS Adj. R-squared: 0.060
Method: Least Squares F-statistic: 76.44
Date: Sat, 20 Jun 2020 Prob (F-statistic): 7.62e-18
Time: 23:35:23 Log-Likelihood: -561.42
No. Observations: 1192 AIC: 1127.
Df Residuals: 1190 BIC: 1137.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.5096 0.196 12.815 0.000 2.125 2.894
height 0.0255 0.003 8.743 0.000 0.020 0.031
==============================================================================
Omnibus: 223.552 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 457.878
Skew: -1.079 Prob(JB): 3.74e-100
Kurtosis: 5.136 Cond. No. 1.17e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.39
A coefficient of \(0.0255\) is less interpretable here as it still corresponds to a change of \(10^{0.0255} = 1.062\)
Building a regression model on the log scale¶
earnings_df["male"] = 1
# Set rows where sex==2 to female (0)
earnings_df.loc[earnings_df.sex == 2, "male"] = 0
earn_log_model = smf.ols(formula="""log_earn ~ height + male""", data=earnings_df).fit()
print(earn_log_model.summary())
print("Residual SD: {}".format(np.round(earn_log_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log_earn R-squared: 0.087
Model: OLS Adj. R-squared: 0.085
Method: Least Squares F-statistic: 56.34
Date: Sat, 20 Jun 2020 Prob (F-statistic): 4.21e-24
Time: 23:35:23 Log-Likelihood: -1538.7
No. Observations: 1192 AIC: 3083.
Df Residuals: 1189 BIC: 3099.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 8.1527 0.603 13.530 0.000 6.970 9.335
height 0.0207 0.009 2.218 0.027 0.002 0.039
male 0.4232 0.072 5.841 0.000 0.281 0.565
==============================================================================
Omnibus: 225.577 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 462.608
Skew: -1.088 Prob(JB): 3.52e-101
Kurtosis: 5.141 Cond. No. 1.59e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.88
Including an interaction¶
earnings_df["male"] = 1
# Set rows where sex==2 to female (0)
earnings_df.loc[earnings_df.sex == 2, "male"] = 0
earn_log_model3 = smf.ols(
formula="""log_earn ~ height + male + height:male""", data=earnings_df
).fit()
print(earn_log_model3.summary())
print("Residual SD: {}".format(np.round(earn_log_model3.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log_earn R-squared: 0.087
Model: OLS Adj. R-squared: 0.084
Method: Least Squares F-statistic: 37.58
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.32e-23
Time: 23:35:23 Log-Likelihood: -1538.6
No. Observations: 1192 AIC: 3085.
Df Residuals: 1188 BIC: 3106.
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 8.3885 0.844 9.945 0.000 6.734 10.043
height 0.0170 0.013 1.304 0.193 -0.009 0.043
male -0.0786 1.258 -0.062 0.950 -2.546 2.389
height:male 0.0074 0.019 0.400 0.690 -0.029 0.044
==============================================================================
Omnibus: 225.961 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 463.655
Skew: -1.089 Prob(JB): 2.08e-101
Kurtosis: 5.143 Cond. No. 4.16e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Residual SD: 0.88
Interpretation:
- Intercept = Predicted log earning when height and male is zero (which is non feasible hence not interpretable)
- Coeff of height (0.017) = Predicted difference in log earnings corresponding to a 1-inch difference in height if male = 0 => The estimated predictive difference in earnings for a 1-inch difference in height is 1.7% for women.
- Coeff of male (-0.0786): Predicted difference in log earnings between women and men if height equals 0. But heights are never zero so this is not interpretable.
- Interaction term height:male (0.0074): Predictive difference in log earnings comparing men to women => An inch of height difference corresponds to 0.7% more of an in increase in earnings among men than among women and the estimated predictive difference in log earnings is 1.7% + 0.7% = 2.4% per inch of height in men.
Linear transformation to make coefficients more interpretable:¶
earnings_df["height_z"] = (
earnings_df["height"] - earnings_df["height"].mean()
) / earnings_df["height"].std()
earn_log_model4 = smf.ols(
formula="""log_earn ~ height_z + male + height_z:male""", data=earnings_df
).fit()
print(earn_log_model4.summary())
print("Residual SD: {}".format(np.round(earn_log_model4.resid.std(), 2)))
print(earnings_df["height"].mean(), earnings_df["height"].std())
OLS Regression Results
==============================================================================
Dep. Variable: log_earn R-squared: 0.087
Model: OLS Adj. R-squared: 0.084
Method: Least Squares F-statistic: 37.58
Date: Sat, 20 Jun 2020 Prob (F-statistic): 3.32e-23
Time: 23:35:23 Log-Likelihood: -1538.6
No. Observations: 1192 AIC: 3085.
Df Residuals: 1188 BIC: 3106.
Df Model: 3
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 9.5266 0.045 210.878 0.000 9.438 9.615
height_z 0.0654 0.050 1.304 0.193 -0.033 0.164
male 0.4197 0.073 5.748 0.000 0.276 0.563
height_z:male 0.0286 0.072 0.400 0.690 -0.112 0.169
==============================================================================
Omnibus: 225.961 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 463.655
Skew: -1.089 Prob(JB): 2.08e-101
Kurtosis: 5.143 Cond. No. 4.64
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.88
66.91694630872483 3.846637848956869
Interpretation:
- Intercept = Predicted log earning when height is average height and male is zero (which is non feasible hence not interpretable) => 66.9 inch women has a log earnings of 9.52
- Coeff of height (0.065) = Predicted difference in log earnings corresponding to a 1-inch standard deviation in height if male = 0 => The estimated predictive difference in earnings for a 3.8 inch difference in height of females is 6% for women.
- Coeff of male (0.4197): Predicted difference in log earnings between women and men if height equals 66.9 inches. Thus, males of height 66.9 inches are expected to have 41.9% higher log earning than women
- Interaction term height:male (0.0286): Predictive difference in log earnings comparing men to women => A 2.8 inch of height difference corresponds to 2.8% more of an in increase in earnings among men than among women and the estimated predictive difference in log earnings is 41.9% + 2.7% = 44.6% per inch of height in men.
Using discrete rather than continous predictors¶
kid_score_momwork = smf.ols(formula="""kid_score ~ C(mom_work)""", data=kidiq_df).fit()
print(kid_score_momwork.summary())
print("Residual SD: {}".format(np.round(kid_score_momwork.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: kid_score R-squared: 0.024
Model: OLS Adj. R-squared: 0.018
Method: Least Squares F-statistic: 3.590
Date: Sat, 20 Jun 2020 Prob (F-statistic): 0.0138
Time: 23:35:24 Log-Likelihood: -1918.9
No. Observations: 434 AIC: 3846.
Df Residuals: 430 BIC: 3862.
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 82.0000 2.305 35.568 0.000 77.469 86.531
C(mom_work)[T.2] 3.8542 3.095 1.245 0.214 -2.229 9.937
C(mom_work)[T.3] 11.5000 3.553 3.237 0.001 4.517 18.483
C(mom_work)[T.4] 5.2098 2.704 1.927 0.055 -0.105 10.524
==============================================================================
Omnibus: 12.371 Durbin-Watson: 1.477
Prob(Omnibus): 0.002 Jarque-Bera (JB): 12.950
Skew: -0.411 Prob(JB): 0.00154
Kurtosis: 2.796 Cond. No. 5.92
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 20.16
Building regression models for prediction¶
mesquite_df = pd.read_csv("../data/mesquite/mesquite.tsv.gz", sep="\t")
mesquite_df.head()
Obs | Group | Diam1 | Diam2 | TotHt | CanHt | Dens | LeafWt | |
---|---|---|---|---|---|---|---|---|
0 | 1 | MCD | 1.8 | 1.15 | 1.30 | 1.00 | 1 | 401.3 |
1 | 2 | MCD | 1.7 | 1.35 | 1.35 | 1.33 | 1 | 513.7 |
2 | 3 | MCD | 2.8 | 2.55 | 2.16 | 0.60 | 1 | 1179.2 |
3 | 4 | MCD | 1.3 | 0.85 | 1.80 | 1.20 | 1 | 308.0 |
4 | 5 | MCD | 3.3 | 1.90 | 1.55 | 1.05 | 1 | 855.2 |
formula = """LeafWt ~ Diam1 + Diam2 + CanHt + TotHt + Dens + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: LeafWt R-squared: 0.848
Model: OLS Adj. R-squared: 0.825
Method: Least Squares F-statistic: 36.34
Date: Sat, 20 Jun 2020 Prob (F-statistic): 1.73e-14
Time: 23:35:24 Log-Likelihood: -318.82
No. Observations: 46 AIC: 651.6
Df Residuals: 39 BIC: 664.4
Df Model: 6
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -1091.8880 176.456 -6.188 0.000 -1448.804 -734.972
Group[T.MCD] 363.2951 100.184 3.626 0.001 160.654 565.936
Diam1 189.6690 112.760 1.682 0.101 -38.410 417.748
Diam2 371.4621 124.378 2.987 0.005 119.883 623.041
CanHt 355.6653 209.843 1.695 0.098 -68.782 780.113
TotHt -101.7325 185.574 -0.548 0.587 -477.091 273.626
Dens 131.2542 34.355 3.820 0.000 61.764 200.744
==============================================================================
Omnibus: 9.865 Durbin-Watson: 2.137
Prob(Omnibus): 0.007 Jarque-Bera (JB): 19.723
Skew: 0.366 Prob(JB): 5.21e-05
Kurtosis: 6.123 Cond. No. 26.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 250.39
Multiplicative Model¶
formula = """log(LeafWt) ~ log(Diam1) + log(Diam2) + log(CanHt) + log(TotHt) + log(Dens) + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log(LeafWt) R-squared: 0.887
Model: OLS Adj. R-squared: 0.870
Method: Least Squares F-statistic: 51.17
Date: Sat, 20 Jun 2020 Prob (F-statistic): 5.73e-17
Time: 23:35:24 Log-Likelihood: -10.406
No. Observations: 46 AIC: 34.81
Df Residuals: 39 BIC: 47.61
Df Model: 6
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 4.7680 0.155 30.747 0.000 4.454 5.082
Group[T.MCD] 0.5834 0.129 4.534 0.000 0.323 0.844
log(Diam1) 0.3938 0.282 1.397 0.170 -0.177 0.964
log(Diam2) 1.1512 0.210 5.477 0.000 0.726 1.576
log(CanHt) 0.3732 0.281 1.330 0.191 -0.194 0.941
log(TotHt) 0.3943 0.313 1.260 0.215 -0.239 1.027
log(Dens) 0.1093 0.122 0.896 0.376 -0.137 0.356
==============================================================================
Omnibus: 1.935 Durbin-Watson: 1.362
Prob(Omnibus): 0.380 Jarque-Bera (JB): 1.047
Skew: -0.283 Prob(JB): 0.593
Kurtosis: 3.474 Cond. No. 12.6
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.31
Interpretation => A difference of x% in canopy height leads to a difference of 0.37x% difference in leaf weight.
The above model is a complicated one. Maybe we can start with something very simple.
mesquite_df["CanVol"] = mesquite_df['Diam1'] * mesquite_df['Diam2'] * mesquite_df['CanHt']
formula = """log(LeafWt) ~ log(CanVol)"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log(LeafWt) R-squared: 0.799
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 175.1
Date: Sat, 20 Jun 2020 Prob (F-statistic): 6.09e-17
Time: 23:35:24 Log-Likelihood: -23.687
No. Observations: 46 AIC: 51.37
Df Residuals: 44 BIC: 55.03
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 5.1697 0.083 62.066 0.000 5.002 5.338
log(CanVol) 0.7224 0.055 13.234 0.000 0.612 0.832
==============================================================================
Omnibus: 2.765 Durbin-Watson: 0.996
Prob(Omnibus): 0.251 Jarque-Bera (JB): 2.066
Skew: -0.515 Prob(JB): 0.356
Kurtosis: 3.125 Cond. No. 2.59
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.41
So leaf weight is approximately proportional to 0.72th power to Canopy Volume. It has a impressive \(R^2\) of 0.8
mesquite_df["CanVol"] = mesquite_df['Diam1'] * mesquite_df['Diam2'] * mesquite_df['CanHt']
mesquite_df["CanArea"] = mesquite_df['Diam1'] * mesquite_df['Diam2']
mesquite_df["Shape"] = mesquite_df['Diam1'] / mesquite_df['Diam2']
formula = """log(LeafWt) ~ log(CanVol) + log(CanArea) + log(Shape) + log(TotHt) + log(Dens) + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log(LeafWt) R-squared: 0.887
Model: OLS Adj. R-squared: 0.870
Method: Least Squares F-statistic: 51.17
Date: Sat, 20 Jun 2020 Prob (F-statistic): 5.73e-17
Time: 23:35:24 Log-Likelihood: -10.406
No. Observations: 46 AIC: 34.81
Df Residuals: 39 BIC: 47.61
Df Model: 6
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 4.7680 0.155 30.747 0.000 4.454 5.082
Group[T.MCD] 0.5834 0.129 4.534 0.000 0.323 0.844
log(CanVol) 0.3732 0.281 1.330 0.191 -0.194 0.941
log(CanArea) 0.3993 0.294 1.357 0.183 -0.196 0.994
log(Shape) -0.3787 0.231 -1.642 0.109 -0.845 0.088
log(TotHt) 0.3943 0.313 1.260 0.215 -0.239 1.027
log(Dens) 0.1093 0.122 0.896 0.376 -0.137 0.356
==============================================================================
Omnibus: 1.935 Durbin-Watson: 1.362
Prob(Omnibus): 0.380 Jarque-Bera (JB): 1.047
Skew: -0.283 Prob(JB): 0.593
Kurtosis: 3.474 Cond. No. 22.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.31
formula = """log(LeafWt) ~ log(CanVol) + log(CanArea) + Group"""
mesquite_model = smf.ols(formula=formula, data=mesquite_df).fit()
print(mesquite_model.summary())
print("Residual SD: {}".format(np.round(mesquite_model.resid.std(), 2)))
OLS Regression Results
==============================================================================
Dep. Variable: log(LeafWt) R-squared: 0.873
Model: OLS Adj. R-squared: 0.864
Method: Least Squares F-statistic: 96.01
Date: Sat, 20 Jun 2020 Prob (F-statistic): 7.79e-19
Time: 23:35:24 Log-Likelihood: -13.198
No. Observations: 46 AIC: 34.40
Df Residuals: 42 BIC: 41.71
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 4.6975 0.118 39.812 0.000 4.459 4.936
Group[T.MCD] 0.5269 0.116 4.558 0.000 0.294 0.760
log(CanVol) 0.6148 0.191 3.215 0.003 0.229 1.001
log(CanArea) 0.2889 0.238 1.215 0.231 -0.191 0.768
==============================================================================
Omnibus: 1.398 Durbin-Watson: 1.282
Prob(Omnibus): 0.497 Jarque-Bera (JB): 0.694
Skew: -0.261 Prob(JB): 0.707
Kurtosis: 3.300 Cond. No. 13.3
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Residual SD: 0.33