library(ggplot2)
data <- read.csv('data_ch9_16.csv', header=T)
ggplot(data, aes(x=DurationOfVisit, y=PollenRemoved, color=BeeType)) +
geom_point(shape=1) +
scale_colour_hue(l=50) +
geom_smooth(method=lm,
se=FALSE)
From the graph we see that the relation between proportion removed and duration visited is not linear
data$PollenRemovedlogit = log(data$PollenRemoved/(1-data$PollenRemoved))
ggplot(data, aes(x=DurationOfVisit, y=PollenRemovedlogit, color=BeeType)) +
geom_point(shape=1) +
scale_colour_hue(l=50) +
geom_smooth(method=lm,
se=FALSE)
From the plot above it does NOT seem that the logit transformed PollenRemoved has a linear relationship with duration of visit
Model:
data$DurationOfVisitlog = log(data$DurationOfVisit)
ggplot(data, aes(x=DurationOfVisitlog, y=PollenRemovedlogit, color=BeeType)) +
geom_point(shape=1) +
scale_colour_hue(l=50) +
geom_smooth(method=lm,
se=FALSE)
From the plot of \(logit(PollenRemoved)\) vs \(log(DurationoOfVisit)\) it seems that these follow a linear relationship.
Model:
lmfit <- lm(PollenRemovedlogit ~ BeeType + DurationOfVisitlog
+ BeeType*DurationOfVisitlog, data=data)
summary(lmfit)
##
## Call:
## lm(formula = PollenRemovedlogit ~ BeeType + DurationOfVisitlog +
## BeeType * DurationOfVisitlog, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3803 -0.3699 0.0307 0.4552 1.1611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.0390 0.5115 -5.941 4.45e-07 ***
## BeeTypeWorker 1.3770 0.8722 1.579 0.122
## DurationOfVisitlog 1.0121 0.1902 5.321 3.52e-06 ***
## BeeTypeWorker:DurationOfVisitlog -0.2709 0.2817 -0.962 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6525 on 43 degrees of freedom
## Multiple R-squared: 0.6151, Adjusted R-squared: 0.5882
## F-statistic: 22.9 on 3 and 43 DF, p-value: 5.151e-09
The p-value of the interaction term is 0.342 which is not significant at the threshold level of 0.05. Thus we cannot reject the null hypothesis that the interaction term is 0. Thus this tells us that there very little evidence that the proportion of pollen depends on duration of visits differently for queens than for workers
lmfit <- lm(PollenRemovedlogit ~ BeeType + DurationOfVisitlog ,
data=data)
summary(lmfit)
##
## Call:
## lm(formula = PollenRemovedlogit ~ BeeType + DurationOfVisitlog,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40852 -0.49627 0.08815 0.43598 1.15562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7146 0.3842 -7.065 9.18e-09 ***
## BeeTypeWorker 0.5697 0.2364 2.409 0.0202 *
## DurationOfVisitlog 0.8886 0.1402 6.339 1.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.652 on 44 degrees of freedom
## Multiple R-squared: 0.6068, Adjusted R-squared: 0.5889
## F-statistic: 33.95 on 2 and 44 DF, p-value: 1.206e-09
The p-value of the BeeTypeWorker coefficient is 0.02 which is significant for a 0.05 level threshold and the estimate of the slope is 0.5697 so this is the amount by which the mean number of pollen proportions exceeds that with bee type queen. And hence YES, there is enough evidence to say that while adjusting for the time spent on flowers, workers remove a larger proportion than queens(alternatively queen removes a smaller proportion than workers)
Rearranging the model equation:
With the interaction term included, the effect of the indicator variable is now \(\beta_2+\beta_3 DurationOFVisitlog\) and hence the difference between the old(0.122) and new p-value(0.02) can be attributed to this difference in the model.
\[ SS(\beta_0,\beta_1\dots \beta_n) = \sum_{i=1}^N(Y_i-\beta_0-\beta_1X_{1i}-\beta_2X_{2i}-\cdots-\beta_pX_{pi})^2 \]
Similarly,
To prove that this is indeed the minimum, we need to show that \[\frac{\partial^2 SS}{\partial \beta_i^2}\] is convex:
Similarly for any \(1 \leq j \leq p\):And for \[ k \neq j \]:
each element in the hessian matrix in this case can be written as \(H_{ij}=\mathbb{x_i}^T\mathbb{x_j}\) and hence this is a Gram matrix and positive definite, hence minima at the above point is guaranteed.
Inverse of \(X^TX\) always exists, unless \(p+1 > n\) that is to say the number of regressors are more than the number of samples. This can still be solved by adding small values to \(X^TX\) that makes it invertible.