Problem 1

Define the likelihood function: \(L(\theta|N,n,k)\) Then,

We find the MLE using first principle.

In order to ensure MLE, we need to ensure: \(\frac{L(\theta|N,n,k)}{L(\theta-1|N,n,k)} >1\)

Similarly, \(\frac{L(\theta|N,n,k)}{L(\theta+1|N,n,k)} > 1\)

Thus, \(\frac{k(N+1)}{n} -1 < \theta < \frac{k(N+1)}{n}\)

and hence a valid choice for MLE is \(\hat{\theta} = \lfloor{\frac{ N(k+1) }{n} }\rfloor\)

Part (b)

N=19, n=4, k=3

N <- 10
n <- 4
k <- 3
theta <- ceiling(k*(N+1)/n)

\(\hat{\theta} =\) 9

To find one p-value for \(H_\theta : \theta=4\) we need to calculate: \(\sum_{\theta=4}^9 Pr(X=k)\)

#theta = seq(4,9,1);
#s <- sapply(theta, function(x) choose(x,3) *choose(10-x,1)/(choose(10,4)))
k <- seq(3,4,1);
s <- sapply(theta, function(x) choose(4,k) * choose(10-4,k)/(choose(10,4)) )
sum(s)
## [1] 0.452381

Thus, one sided p-value is 0.452381

Problem 2

\(Y \sim Binomial(n, \pi)\) and \(\hat{\pi} = \frac{Y}{n}\) Define \(g(\pi) = \log{\frac{\pi}{1-\pi}}\) Then by delta method \(E[g(\pi)] = g(E[\hat{\pi}])\)

\(Var(g(\hat{pi})) = g'(\pi)^2 Var(\hat{\pi}) = E[g(\hat{\pi})^2]-(E[g(\hat{\pi})])^2\)

Thus mean square error is given by \(E[g(\hat{\pi})^2] = g'(\pi)^2 Var(\hat{\pi})+(E[g(\hat{\pi})])^2\)

where \(Var(\hat{\pi})=\hat{\pi}(1-\hat{\pi})=\frac{Y}{n}(1-\frac{Y}{n})\) and \(g'(\hat{\pi}) = \frac{1}{\hat{\pi}} + \frac{1}{1-\hat{\pi}} = \frac{1}{\hat{\pi}(1-\hat{\pi})}\)

Thus, \(Var(\hat{\pi}) = \frac{1}{\hat{\pi}^2(1-\hat{\pi})^2} \times \hat{\pi}(1-\hat{\pi}) = \frac{1}{\hat{\pi}(1-\hat{\pi})}\)

Thus, \(E[g(\hat{\pi})^2]=\frac{1}{\hat{\pi}(1-\hat{\pi})}+ \log^2{\frac{\hat{\pi}}{1-\hat{\pi}}}\)

Problem 3

A possible model to describe the given two-way factorial experiment would be two-way ANOVA.

Problem 4

Part (a)

In this part, the variables \(X_1\) and \(X_2\) are correlated. The main diffciulty that would potentially arise would be in interpreting the cofficients associtated with these variables. Cofficients by definition imply the amount by which the mean response changes when all other covariates are held fixed. However in this case since \(X_1\) and \(X_2\) are highly correlated, changing one would also imply chaning the other.

In order to overcome this, we need to choose either of \(X_1\) or \(X_2\) as a covariate in the linear regression model(essentially discarding the other) based on which one of these predictors best captures the ‘reality’ of the independent variable.

Part (b)

There exists an outlier on the top right which is an outlier considering \((X_1,X_2)\) together but not individually. Since we are minimizing the squared error, the presence of such outlier points will affect the cofficients (and they will turn out to be smaller in magnitude). One way to overcome such outliers is to either use regularization(L1,L2) or to completely neglect such outliers.

Problem 5

Given \(\log{\frac{p}{1-p}}=3.2 - 0.078 \times age\) for men and \(\log{\frac{p}{1-p}}=1.6 - 0.078 \times age\) for women where \(p\) denotes the probability of survival.

Consider
pman <- 1/(1+exp(-(3.2-0.078*25)))
pwoman <-  1/(1+exp(-(1.6-0.078*50)))

Thus, Estimated probability of survival of man of age 25 = 0.7772999 and of woemn aged 50 = 0.091123

Age at which probability of survival is 0.5:

Thus, man’s age at which probability of surviving is 0.5: 41.025641 and for woman: 20.5128205

Problem 6

library(car)
problem6 <- read.csv('problem6.csv')
model <- lm(y~x1+x2, problem6)
vif(model)
##       x1       x2 
## 1.666667 1.666667

We first consider \(VIF1\) \(X_1 = \beta_0 + \beta_1X_2\) \(\bar{X_1} = 2.5\) and \(\bar{X_2} = 0\) Also, \(\sum_i(X_{2i}-\bar{X_2})^2 = (-1-0)^2+(0-0)^2+(1-0)^2+(0-0)^2=2\)

\(\beta_1 = \sum_i\frac{(X_{1i}-\bar{X_1})(X_{2i}-\bar{X_2})}{\sum_i(X_{2i}-\bar{X_2})} = \frac{(1-2.5)(-1-0)+(2-2.5)(0) + (3-2.5)(1-0) + (4-2.5)(0)}{2} = 1\)

\(\beta_0 = \bar{X_1}-\beta_1 \bar{X_1} = 2.5-0 = 2.5\)

\(X_2 = 2.5 + X_1\)

$SS_{Total} = i(X{1i}-{X_1})^2 = (1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2 = 5 $

\(SS_{Res}=(1-2.5)^2+(2-2.5)^2+(3-2.5)^2+(4-2.5)^2=3\)

Thus \(R_1^2 = 1-SS_{Res}/SS{Tot} = 1-3/5 = 0.4\)

And \(VIF1 = 1/(1-R_1^2) = 1.6667\)

Similarly for \(VIF2\) we consider \(X_2 = \beta_0 + \beta_1 X_1\)

\(\sum_i (X_{1i}-\bar{X_1})^2 = 5\) and hence \(\beta_1 = \frac{2}{5}\) Thus, \(\beta_0 = \bar{X_2}-\beta_1\bar{X_1} = 0-2.5*2/5 = -1\)

Thus, \(X_2 = -1 + 0.4 X_1\)

\(SS_{Tot} = \sum_i (X_{2i}-\bar{X_2})^2 = 2\)

And \(SS_{Res} = (-1+0.6)^2+(0+0.2)^2+(1-0.2)^2+(0-0.6)^2 = 1.2\)

and hence \(R_2^2 = 1-1.2/2=0.4\) and \(VIF2 = 1/(1-R_2^2) = 1.667\)

Part (b)

c <- summary(model)
s <- c$coefficients
SE1 <- s[2,2]
SE2 <- s[3,3]
b1  <- s[2,1]
b2  <- s[3,1]

Problem 7

Problem 8

Given model: \(T=A(t-t_0)^pz^q\); \(T(t,z)\)

Part (a)

To estimate A,p and q, we consider the log transformed model: \(\log(Z) = \log(A)+p\log(t-t_0) + q\log(z)\) , this is essentially a linear regression model of the following form : \(Y=\beta_0 + \beta_1X_1 + \beta_2X_2\) where:

and the coefficients \((\beta_0, \beta_1, \beta_2)^T= (X^TX)^{-1}X^TY\)

Part (b)

To test the hypothesis that temperature does not change in time for each depth,