GLMs

Keywords

biostatistics, healthcare, statistics, R, statistical testing, power analysis, GLM, regression

Dataset

We’ll use the DoctorVisits dataset from the AER package, based on the Australian Health Survey. This dataset includes:

  • visits: Number of doctor consultations (outcome variable)
  • gender: Male or Female
  • age: Age in years (divided by 100 in raw data)
  • income: Annual income in tens of thousands
  • illness: Number of illnesses in past 2 weeks
  • health: Self-assessed health score (0-12, higher values = better health)
  • nchronic: Chronic condition indicator (no/yes)
  • private: Private health insurance indicator (no/yes)

This is real survey data from 5,190 individuals in Australia.

Setup

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)        

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select
library(pscl)        
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
library(AER)         
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich
Loading required package: survival
library(ggpubr)
library(patchwork)

Attaching package: 'patchwork'

The following object is masked from 'package:MASS':

    area
library(broom)      
library(knitr)
library(gridExtra)   

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
theme_set(theme_minimal(base_size = 12))

Data Loading and Preparation

data("DoctorVisits", package = "AER")

health_data <- DoctorVisits %>%
  mutate(
    age_years = age * 100,
    income_k = income * 10,
    private_insurance = factor(private, levels = c("no", "yes"), labels = c("No", "Yes")),
    chronic_condition = factor(nchronic, levels = c("no", "yes"), labels = c("No", "Yes"))
  ) %>%
  dplyr::select(
    visits,              # Outcome: number of doctor visits
    gender,              # Gender
    age_years,           # Age in years
    income_k,            # Income in thousands
    illness,             # Recent illnesses
    health,              # Self-assessed health score (0-12, higher = better)
    chronic_condition,   # Has chronic condition
    private_insurance    # Private insurance
  )

head(health_data) %>% kable()
visits gender age_years income_k illness health chronic_condition private_insurance
1 female 19 5.5 1 1 No Yes
1 female 19 4.5 1 1 No Yes
1 male 19 9.0 3 0 No No
1 male 19 1.5 1 0 No No
1 male 19 4.5 2 1 Yes No
1 female 19 3.5 5 9 Yes No
cat("Sample size:", nrow(health_data), "individuals\n\n")
Sample size: 5190 individuals
summary(health_data)
     visits          gender       age_years        income_k     
 Min.   :0.0000   male  :2488   Min.   :19.00   Min.   : 0.000  
 1st Qu.:0.0000   female:2702   1st Qu.:22.00   1st Qu.: 2.500  
 Median :0.0000                 Median :32.00   Median : 5.500  
 Mean   :0.3017                 Mean   :40.64   Mean   : 5.832  
 3rd Qu.:0.0000                 3rd Qu.:62.00   3rd Qu.: 9.000  
 Max.   :9.0000                 Max.   :72.00   Max.   :15.000  
    illness          health       chronic_condition private_insurance
 Min.   :0.000   Min.   : 0.000   No :3098          No :2892         
 1st Qu.:0.000   1st Qu.: 0.000   Yes:2092          Yes:2298         
 Median :1.000   Median : 0.000                                      
 Mean   :1.432   Mean   : 1.218                                      
 3rd Qu.:2.000   3rd Qu.: 2.000                                      
 Max.   :5.000   Max.   :12.000                                      

EDA

p1 <- ggplot(health_data, aes(x = visits)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(title = "Distribution of doctor office visits",
       x = "Number of visits",
       y = "Frequency") +
  theme_minimal(base_size = 12)

p2 <- ggplot(health_data, aes(x = visits)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
  scale_y_log10() +
  labs(title = "Distribution (Log Scale)",
       x = "Number of visits",
       y = "Frequency (log scale)") +
  theme_minimal(base_size = 12)

p1 + p2
Figure 1: Distribution of doctor visits is highly right-skewed
desc_stats <- health_data %>%
  summarise(
    Mean = mean(visits),
    Variance = var(visits),
    `Variance/Mean` = Variance / Mean,
    Median = median(visits),
    SD = sd(visits),
    Min = min(visits),
    Max = max(visits),
    `% Zeros` = mean(visits == 0) * 100
  )

desc_stats %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kable()
Table 1: Descriptive statistics for office visits
Mean Variance Variance/Mean Median SD Min Max % Zeros
0.3 0.64 2.11 0 0.8 0 9 79.79

Variance (0.64) is much larger than mean (0.3). The variance-to-mean ratio is 2.11, indicating overdispersion.

A natural choice for modeling counts data is poisson, but the overdispersion suggests we may need a more flexible model like Negative Binomial. ### Office Visits by Health Status

p1 <- ggplot(health_data, aes(x = health, y = visits, fill = health)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("poor" = "#e74c3c",
                                "average" = "#f39c12",
                                "excellent" = "#27ae60")) +
  labs(title = "Office visits by health status",
       x = "Health status",
       y = "Number of Visits") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

p2 <- health_data %>%
  group_by(health) %>%
  summarise(
    Mean = mean(visits),
    Median = median(visits),
    SD = sd(visits)
  ) %>%
  pivot_longer(cols = c(Mean, Median), names_to = "Statistic", values_to = "Value") %>%
  ggplot(aes(x = health, y = Value, fill = Statistic)) +
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Mean and median office visits",
       x = "Health status",
       y = "Number of visits") +
  theme_minimal(base_size = 12)

p1 + p2
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: No shared levels found between `names(values)` of the manual scale and the
data's fill values.
Figure 2: Office visits vary substantially by health status

Chronic conditions effect

health_data %>%
  ggplot(aes(x = chronic_condition, y = visits)) +
  geom_boxplot(fill = "coral", alpha = 0.7, outlier.alpha = 0.3) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3,
               fill = "red", color = "darkred") +
  labs(title = "Office visits by chronic condition status",
       subtitle = "Red diamond = mean, box shows median and IQR",
       x = "Has chronic condition",
       y = "Number of office visits") +
  theme_minimal(base_size = 12)
Figure 3: More chronic conditions strongly associated with more office visits

Why not OLS?

Before diving into Poisson and Negative Binomial models, let’s see why ordinary least squares (OLS) regression fails for count data.

ols_model <- lm(visits ~ health + chronic_condition + gender + age_years +
                private_insurance + illness,
                data = health_data)

comparison_data <- health_data %>%
  slice_sample(n = 1000) %>%  # Sample for clearer visualization
  mutate(
    ols_pred = predict(ols_model, newdata = .)
  )

p1 <- ggplot(comparison_data, aes(x = health, y = visits)) +
  geom_jitter(alpha = 0.3, width = 0.2, height = 0.2, color = "gray40") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.5) +
  labs(title = "OLS fit",
       subtitle = "Linear fit can extrapolate to impossible negative values",
       x = "Health Score", y = "Office Visits") +
  theme_minimal(base_size = 11)

p2 <- ggplot(comparison_data, aes(x = ols_pred)) +
  geom_histogram(bins = 30, fill = "red", alpha = 0.6) +
  geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1) +
  labs(title = "OLS preds",
       subtitle = paste0("Min = ", round(min(comparison_data$ols_pred), 2),
                        " (impossible!)"),
       x = "Predicted visits", y = "count") +
  theme_minimal(base_size = 11)

ols_diag <- data.frame(
  fitted = fitted(ols_model),
  residuals = residuals(ols_model)
)

p4 <- ggplot(ols_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3, color = "red") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, color = "darkred") +
  labs(title = "OLS residuals vs fitted",
       subtitle = "Clear heteroscedasticity pattern",
       x = "Fitted values", y = "residuals") +
  theme_minimal(base_size = 11)

p5 <- ggplot(ols_diag, aes(sample = residuals)) +
  stat_qq(color = "red", alpha = 0.5) +
  stat_qq_line(color = "darkred", linewidth = 1) +
  labs(title = "OLS Q-Q Plot",
       subtitle = "Heavy right tail = non-normality",
       x = "Theoretical quantiles", y = "Sample quantiles") +
  theme_minimal(base_size = 11)

ols_summary <- data.frame(
  Statistic = c("Min prediction", "Max prediction", "Negative predictions", "Mean abs error"),
  Value = c(min(comparison_data$ols_pred),
            max(comparison_data$ols_pred),
            sum(comparison_data$ols_pred < 0),
            mean(abs(comparison_data$visits - comparison_data$ols_pred)))
)

p3 <- gridExtra::tableGrob(ols_summary %>%
                            mutate(Value = round(Value, 2)),
                            rows = NULL)

(p1 | p2 | p3) / (p4 | p5 | plot_spacer())
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 4: OLS produces impossible predictions and violates assumptions for count data
WarningWhy OLS is not the most optimal model
  1. OLS can predict negative counts (e.g., -0.5 doctor visits)

  2. Variance increases with the mean, violating constant variance assumption

    • People with many visits have more variable outcomes than those with few visits

3.Count data are right-skewed, violating normality of residuals - Most people have 0-2 visits; a few have 10+

  1. Counts follow discrete distributions (Poisson/Negative Binomial), not continuous normal
NoteWhat about the coefficients?

You might wonder: “If I just use OLS for interpretation and ignore the predictions, is that okay?”

No! The coefficients and standard errors from OLS are also biased and inconsistent for count data:

  • Biased standard errors → Wrong confidence intervals and p-values
  • Inefficient estimates → Larger standard errors than necessary
  • Wrong interpretation → OLS coefficients are additive, but count effects are multiplicative

We’ll see the correct approach (Poisson/NB) produces different coefficients with the proper multiplicative interpretation.


Now let’s fit the proper models for count data.

Poisson Regression

A count variable \(Y\) follows a Poisson distribution with parameter \(\lambda > 0\):

\[ P(Y = y) = \frac{e^{-\lambda} \lambda^y}{y!}, \quad y = 0, 1, 2, \ldots \]

Key property: For Poisson, \(E[Y] = \text{Var}(Y) = \lambda\).

Poisson GLM

Components:

  1. Random component: \(Y_i \sim \text{Poisson}(\lambda_i)\)

  2. Systematic component: Linear predictor \[ \eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} \]

  3. Link function: Log link (canonical for Poisson) \[ \log(\lambda_i) = \eta_i \]

Mean structure: \[ \lambda_i = E[Y_i] = \exp(\eta_i) = \exp(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \]

Interpretation of coefficients

For predictor \(x_j\):

  • \(\beta_j\): Change in log-count per unit increase in \(x_j\)
  • \(\exp(\beta_j)\): Multiplicative effect on the count
    • If \(\exp(\beta_j) = 1.5\): A one-unit increase in \(x_j\) multiplies the expected count by 1.5 (50% increase)
    • If \(\exp(\beta_j) = 0.8\): A one-unit increase in \(x_j\) multiplies the expected count by 0.8 (20% decrease)

Fitting Poisson model

# Fit Poisson GLM
poisson_model <- glm(visits ~ health + chronic_condition + gender + age_years + private_insurance + illness,
                     data = health_data,
                     family = poisson(link = "log"))

# Model summary
summary(poisson_model)

Call:
glm(formula = visits ~ health + chronic_condition + gender + 
    age_years + private_insurance + illness, family = poisson(link = "log"), 
    data = health_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.381559   0.074344 -32.034  < 2e-16 ***
health                0.102714   0.009168  11.204  < 2e-16 ***
chronic_conditionYes -0.095120   0.053855  -1.766  0.07736 .  
genderfemale          0.179821   0.054694   3.288  0.00101 ** 
age_years             0.010480   0.001325   7.908 2.62e-15 ***
private_insuranceYes  0.071464   0.051591   1.385  0.16599    
illness               0.251033   0.017622  14.245  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5634.8  on 5189  degrees of freedom
Residual deviance: 4964.6  on 5183  degrees of freedom
AIC: 7310.1

Number of Fisher Scoring iterations: 6

Coefficient Interpretation

# Extract coefficients with confidence intervals
poisson_coefs <- tidy(poisson_model, conf.int = TRUE) %>%
  mutate(
    exp_estimate = exp(estimate),
    exp_conf_low = exp(conf.low),
    exp_conf_high = exp(conf.high),
    `Percent Change` = (exp_estimate - 1) * 100
  ) %>%
  dplyr::select(term, estimate, exp_estimate, `Percent Change`,
         exp_conf_low, exp_conf_high, p.value)

poisson_coefs %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(col.names = c("Predictor", "β (log scale)", "exp(β)",
                      "% Change", "95% CI Lower", "95% CI Upper", "p-value"))
Table 2: Poisson model coefficients with interpretation
Predictor β (log scale) exp(β) % Change 95% CI Lower 95% CI Upper p-value
(Intercept) -2.3816 0.0924 -90.7594 0.0798 0.1068 0.0000
health 0.1027 1.1082 10.8174 1.0882 1.1280 0.0000
chronic_conditionYes -0.0951 0.9093 -9.0736 0.8181 1.0104 0.0774
genderfemale 0.1798 1.1970 19.7003 1.0757 1.3330 0.0010
age_years 0.0105 1.0105 1.0535 1.0079 1.0132 0.0000
private_insuranceYes 0.0715 1.0741 7.4079 0.9705 1.1881 0.1660
illness 0.2510 1.2854 28.5352 1.2416 1.3304 0.0000

Model diagnostics

Testing for overdispersion

The Poisson assumption is \(\text{Var}(Y) = E[Y]\). If variance exceeds the mean (overdispersion), Poisson underestimates standard errors.

disp_test <- dispersiontest(poisson_model)
print(disp_test)

    Overdispersion test

data:  poisson_model
z = 6.8701, p-value = 3.209e-12
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
   1.85058 
pearson_resid <- residuals(poisson_model, type = "pearson")
dispersion_param <- sum(pearson_resid^2) / poisson_model$df.residual

cat("\n=== Overdispersion Check ===\n")

=== Overdispersion Check ===
cat("Residual deviance:", poisson_model$deviance, "\n")
Residual deviance: 4964.556 
cat("Degrees of freedom:", poisson_model$df.residual, "\n")
Degrees of freedom: 5183 
cat("Deviance / df:", poisson_model$deviance / poisson_model$df.residual, "\n")
Deviance / df: 0.9578538 
cat("Dispersion parameter (Pearson):", dispersion_param, "\n\n")
Dispersion parameter (Pearson): 1.815533 
if (dispersion_param > 1.5) {
  cat("Variance is", round(dispersion_param, 2), "times larger than mean.\n")
  cat("Poisson model may be inappropriate. Consider Negative Binomial.\n")
}
Variance is 1.82 times larger than mean.
Poisson model may be inappropriate. Consider Negative Binomial.
  • Dispersion parameter = 1.82 (should be ≈ 1 for Poisson)
  • This indicates overdispersion –> Use NB

Residual plots

# Create diagnostic plots
par(mfrow = c(2, 2))

# 1. Residuals vs Fitted
plot(fitted(poisson_model), residuals(poisson_model, type = "pearson"),
     xlab = "Fitted balues", ylab = "Pearson residuals",
     main = "Residuals vs fitted", pch = 20, col = rgb(0, 0, 1, 0.3))
abline(h = 0, col = "red", lty = 2, lwd = 2)
abline(h = c(-2, 2), col = "orange", lty = 2)

# 2. Q-Q plot
qqnorm(residuals(poisson_model, type = "pearson"),
       main = "Normal Q-Q Plot",
       pch = 20, col = rgb(0, 0, 1, 0.3))
qqline(residuals(poisson_model, type = "pearson"), col = "red", lwd = 2)

# 3. Scale-Location
plot(fitted(poisson_model), sqrt(abs(residuals(poisson_model, type = "pearson"))),
     xlab = "Fitted values", ylab = "√|Pearson residuals|",
     main = "Scale-Location", pch = 20, col = rgb(0, 0, 1, 0.3))
lines(lowess(fitted(poisson_model), sqrt(abs(residuals(poisson_model, type = "pearson")))),
      col = "red", lwd = 2)

# 4. Residuals vs Leverage
plot(hatvalues(poisson_model), residuals(poisson_model, type = "pearson"),
     xlab = "Leverage", ylab = "Pearson Residuals",
     main = "Residuals vs Leverage", pch = 20, col = rgb(0, 0, 1, 0.3))
abline(h = 0, col = "red", lty = 2, lwd = 2)

par(mfrow = c(1, 1))
Figure 5: Diagnostic plots for Poisson model

Negative Binomial Regression

The Negative Binomial (NB) distribution is a generalization of Poisson that allows overdispersion.

Two parameterizations:

  1. Mean-dispersion form (most common in GLMs): \[ E[Y] = \mu, \quad \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} \] where \(\theta > 0\) is the dispersion parameter.

  2. Key property: As \(\theta \to \infty\), NB → Poisson

When variance > mean: - If \(\theta\) is small: High overdispersion - If \(\theta\) is large: Low overdispersion (closer to Poisson)

NB GLM structure

Same as Poisson, but with different variance structure:

  1. Random component: \(Y_i \sim \text{NB}(\mu_i, \theta)\)

  2. Systematic component: \(\eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\)

  3. Link function: \(\log(\mu_i) = \eta_i\)

Variance: \(\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{\theta}\)

The extra term \(\frac{\mu_i^2}{\theta}\) allows variance to exceed the mean.

Fitting the NB model

nb_model <- glm.nb(visits ~ health + chronic_condition + gender + age_years + private_insurance + illness,
                   data = health_data)

summary(nb_model)

Call:
glm.nb(formula = visits ~ health + chronic_condition + gender + 
    age_years + private_insurance + illness, data = health_data, 
    init.theta = 0.5806522086, link = log)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.449416   0.092423 -26.502  < 2e-16 ***
health                0.109412   0.013455   8.131 4.24e-16 ***
chronic_conditionYes -0.075378   0.070288  -1.072 0.283533    
genderfemale          0.230690   0.069757   3.307 0.000943 ***
age_years             0.009659   0.001704   5.667 1.45e-08 ***
private_insuranceYes  0.064622   0.066611   0.970 0.331977    
illness               0.283096   0.023758  11.916  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.5807) family taken to be 1)

    Null deviance: 3444.5  on 5189  degrees of freedom
Residual deviance: 3008.9  on 5183  degrees of freedom
AIC: 6785.8

Number of Fisher Scoring iterations: 1

              Theta:  0.5807 
          Std. Err.:  0.0465 

 2 x log-likelihood:  -6769.8180 
theta_hat <- nb_model$theta
cat("\n=== Dispersion Parameter ===\n")

=== Dispersion Parameter ===
cat("Theta (θ):", theta_hat, "\n")
Theta (θ): 0.5806522 
cat("Standard Error:", nb_model$SE.theta, "\n")
Standard Error: 0.04648321 

Coefficient interpretation

nb_coefs <- tidy(nb_model, conf.int = TRUE) %>%
  mutate(
    exp_estimate = exp(estimate),
    exp_conf_low = exp(conf.low),
    exp_conf_high = exp(conf.high),
    `Percent Change` = (exp_estimate - 1) * 100
  ) %>%
  dplyr::select(term, estimate, exp_estimate, `Percent Change`,
         exp_conf_low, exp_conf_high, p.value)

nb_coefs %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(col.names = c("Predictor", "β (log scale)", "exp(β)",
                      "% Change", "95% CI Lower", "95% CI Upper", "p-value"))
Table 3: NB coefficients
Predictor β (log scale) exp(β) % Change 95% CI Lower 95% CI Upper p-value
(Intercept) -2.4494 0.0863 -91.3656 0.0718 0.1035 0.0000
health 0.1094 1.1156 11.5621 1.0864 1.1459 0.0000
chronic_conditionYes -0.0754 0.9274 -7.2607 0.8078 1.0641 0.2835
genderfemale 0.2307 1.2595 25.9469 1.1006 1.4416 0.0009
age_years 0.0097 1.0097 0.9706 1.0064 1.0131 0.0000
private_insuranceYes 0.0646 1.0668 6.6756 0.9363 1.2154 0.3320
illness 0.2831 1.3272 32.7232 1.2657 1.3922 0.0000

Interpretation is the same as Poisson: \(\exp(\beta_j)\) gives the multiplicative effect on the expected count.

Comparing Poisson vs NB

Coefficient comparison

# Combine coefficients from both models
coef_compare <- bind_rows(
  poisson_coefs %>% mutate(Model = "Poisson"),
  nb_coefs %>% mutate(Model = "Negative Binomial")
) %>%
  filter(term != "(Intercept)")

ggplot(coef_compare, aes(x = term, y = exp_estimate, color = Model, shape = Model)) +
  geom_point(size = 3, position = position_dodge(width = 0.5)) +
  geom_errorbar(aes(ymin = exp_conf_low, ymax = exp_conf_high),
                width = 0.2, position = position_dodge(width = 0.5)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  coord_flip() +
  scale_color_manual(values = c("Poisson" = "steelblue",
                                 "Negative Binomial" = "firebrick")) +
  labs(title = "Coefficient estimates: Poisson vs Negative Binomial",
       subtitle = "Points show exp(β) with 95% confidence intervals",
       x = "Predictor",
       y = "Multiplicative effect (exp(β))") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")
Figure 6: Coefficient comparison: Poisson vs Negative Binomial

Standard error comparison

se_compare <- tibble(
  Predictor = names(coef(poisson_model))[-1],
  `Poisson SE` = summary(poisson_model)$coefficients[-1, "Std. Error"],
  `NB SE` = summary(nb_model)$coefficients[-1, "Std. Error"],
  `Ratio (NB/Poisson)` = `NB SE` / `Poisson SE`
)

se_compare %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable()
Table 4: Standard errors: Poisson tends to underestimate when overdispersed
Predictor Poisson SE NB SE Ratio (NB/Poisson)
health 0.0092 0.0135 1.4677
chronic_conditionYes 0.0539 0.0703 1.3051
genderfemale 0.0547 0.0698 1.2754
age_years 0.0013 0.0017 1.2860
private_insuranceYes 0.0516 0.0666 1.2911
illness 0.0176 0.0238 1.3481

NB standard errors are larger than Poisson (ratio > 1) –> reflecting the true uncertainty when accounting for overdispersion.

Model stats

We did not cover this in class, but it is not very difficult to understand. A good model should have lower akaike information criterion (AIC) and higher log-likelihood. We can compare the two models using these metrics, as well as a likelihood ratio test:

# Calculate AIC and log-likelihood
model_comparison <- tibble(
  Model = c("Poisson", "Negative Binomial"),
  `Log-Likelihood` = c(logLik(poisson_model), logLik(nb_model)),
  AIC = c(AIC(poisson_model), AIC(nb_model)),
  `Num Parameters` = c(length(coef(poisson_model)),
                       length(coef(nb_model)) + 1),  # +1 for theta
  `Deviance` = c(poisson_model$deviance, nb_model$deviance)
)

model_comparison %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kable()
# Likelihood ratio test
lr_test <- 2 * (logLik(nb_model) - logLik(poisson_model))
lr_pval <- pchisq(lr_test, df = 1, lower.tail = FALSE)

cat("\n=== Likelihood Ratio Test ===\n")

=== Likelihood Ratio Test ===
cat("H0: Poisson is adequate (theta = infinity)\n")
H0: Poisson is adequate (theta = infinity)
cat("H1: Negative Binomial is needed (theta < infinity)\n\n")
H1: Negative Binomial is needed (theta < infinity)
cat("LR statistic:", lr_test, "\n")
LR statistic: 526.306 
cat("p-value:", format.pval(lr_pval), "\n\n")
p-value: < 2.22e-16 
Table 5: Model comparison: AIC and Log-Likelihood
Model Log-Likelihood AIC Num Parameters Deviance
Poisson -3648.06 7310.12 7 4964.56
Negative Binomial -3384.91 6785.82 8 3008.95
if (lr_pval < 0.001) {
  cat("CONCLUSION: Negative Binomial is significantly better (p < 0.001)\n")
} else if (lr_pval < 0.05) {
  cat("CONCLUSION: Negative Binomial is significantly better (p < 0.05)\n")
} else {
  cat("CONCLUSION: No significant difference between models\n")
}
CONCLUSION: Negative Binomial is significantly better (p < 0.001)

Lower AIC indicates better fit: Negative Binomial has AIC = 6785.8 vs Poisson AIC = 7310.1.

Predicted vs Observed

health_data <- health_data %>%
  mutate(
    pred_poisson = predict(poisson_model, type = "response"),
    pred_nb = predict(nb_model, type = "response")
  )

# Count frequencies
observed_freq <- health_data %>%
  count(visits) %>%
  rename(observed = n) %>%
  filter(visits <= 20)  # Limit for visualization

pred_freq <- health_data %>%
  summarise(
    across(c(pred_poisson, pred_nb),
           ~sum(dpois(0:20, lambda = .x)))
  ) %>%
  pivot_longer(everything(), names_to = "model", values_to = "predicted")

# Compare distributions
p1 <- ggplot() +
  geom_col(data = observed_freq, aes(x = visits, y = observed),
           fill = "gray70", alpha = 0.7, width = 0.8) +
  geom_line(data = tibble(visits = 0:20,
                          predicted = sapply(0:20, function(k)
                            sum(dpois(k, lambda = health_data$pred_poisson)))),
            aes(x = visits, y = predicted),
            color = "steelblue", linewidth = 1.5) +
  labs(title = "Poisson Model",
       x = "Number of Office Visits",
       y = "Frequency") +
  theme_minimal()

p2 <- ggplot() +
  geom_col(data = observed_freq, aes(x = visits, y = observed),
           fill = "gray70", alpha = 0.7, width = 0.8) +
  geom_line(data = tibble(visits = 0:20,
                          predicted = sapply(0:20, function(k)
                            sum(dnbinom(k, size = theta_hat,
                                       mu = health_data$pred_nb)))),
            aes(x = visits, y = predicted),
            color = "firebrick", linewidth = 1.5) +
  labs(title = "Negative Binomial Model",
       x = "Number of Office Visits",
       y = "Frequency") +
  theme_minimal()

p1 + p2
Figure 7: Predicted vs observed counts: NB fits better at higher counts

Marginal Effects

marginal_data <- expand.grid(
  health = 0:12,  # Health score range
  chronic_condition = factor(c("No", "Yes"), levels = c("No", "Yes")),
  gender = factor("female", levels = c("male", "female")),
  age_years = 45,
  private_insurance = factor("Yes", levels = c("No", "Yes")),
  illness = 0
)

# Predict
marginal_data <- marginal_data %>%
  mutate(
    predicted = predict(nb_model, newdata = ., type = "response")
  )

ggplot(marginal_data, aes(x = health, y = predicted, color = chronic_condition,
                          fill = chronic_condition)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 2) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "firebrick"),
                     name = "Chronic condition") +
  scale_fill_manual(values = c("No" = "steelblue", "Yes" = "firebrick"),
                     name = "Chronic condition") +
  labs(title = "Predicted office visits",
       subtitle = "For 45-year-old female with private insurance, no recent illnesses",
       x = "Health score (0 = worst, 12 = best)",
       y = "Predicted number of vsisits") +
  theme_minimal(base_size = 12)
Figure 8: Marginal effect of chronic conditions on office visits

References

  1. Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.

  2. Hilbe, J. M. (2011). Negative Binomial Regression (2nd ed.). Cambridge University Press. ## Session Info

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

time zone: Asia/Kolkata
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3   knitr_1.50      broom_1.0.10    patchwork_1.3.2
 [5] ggpubr_0.6.2    AER_1.2-15      survival_3.8-3  sandwich_3.1-1 
 [9] lmtest_0.9-40   zoo_1.8-14      car_3.1-3       carData_3.0-5  
[13] pscl_1.5.9      MASS_7.3-65     lubridate_1.9.4 forcats_1.0.1  
[17] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.0     readr_2.1.5    
[21] tidyr_1.3.1     tibble_3.3.0    ggplot2_4.0.0   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] generics_0.1.4     rstatix_0.7.3      stringi_1.8.7      lattice_0.22-7    
 [5] hms_1.1.4          digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-4       jsonlite_2.0.0     backports_1.5.0    Formula_1.2-5     
[17] mgcv_1.9-4         scales_1.4.0       abind_1.4-8        cli_3.6.5         
[21] rlang_1.1.6        splines_4.5.2      withr_3.0.2        yaml_2.3.10       
[25] tools_4.5.2        tzdb_0.5.0         ggsignif_0.6.4     vctrs_0.6.5       
[29] R6_2.6.1           lifecycle_1.0.4    htmlwidgets_1.6.4  pkgconfig_2.0.3   
[33] pillar_1.11.1      gtable_0.3.6       glue_1.8.0         xfun_0.54         
[37] tidyselect_1.2.1   dichromat_2.0-0.1  farver_2.1.2       nlme_3.1-168      
[41] htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.30     compiler_4.5.2    
[45] S7_0.2.0