Lecture 02: Goodness of Fit Tests

Performing goodness of fit tests

list.of.packages <- c("tidyverse", "ggpubr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)
theme_set(theme_pubr())

Read the data

This data was collected in class.

df <- read_csv("https://gist.github.com/saketkc/622dc866f91c73e8b90540827f0f93ad/raw")
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Choices
dbl (1): Votes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Make sure the months remain ordered
df$Choices <- factor(df$Choices, levels=df$Choices)

ggplot(df, aes(Choices, Votes)) + geom_bar(stat="identity") +
    xlab("Birth Month")+
    ylab("Frequency")

We can create a expectation column based on total votes and the number of months.

df$expectation <- sum(df$Votes) / length(df$Choices)
df
# A tibble: 12 × 3
   Choices Votes expectation
   <fct>   <dbl>       <dbl>
 1 Jan         9        5.08
 2 Feb         6        5.08
 3 Mar         7        5.08
 4 Apr         4        5.08
 5 May         3        5.08
 6 Jun         3        5.08
 7 Jul         5        5.08
 8 Aug         4        5.08
 9 Sep         5        5.08
10 Oct         9        5.08
11 Nov         3        5.08
12 Dec         3        5.08

Goodness of fit test

Null hypothesis: The probability of birthmonths should be equal across the months (100/12=8.33% per month)

observed <- df$Votes
expected <- df$expectation
chi_square_stat <- sum((observed - expected)^2 / expected)
dof <- length(observed) - 1
p_value <- pchisq(chi_square_stat, dof, lower.tail = FALSE)
alpha <- 0.05  # Significance level
if (p_value < alpha) {
  cat("Reject the null hypothesis")
} else {
  cat("Fail to reject the null hypothesis")
}
Fail to reject the null hypothesis
chi_square_stat
[1] 10.80328
p_value
[1] 0.4598882

Using G-stat:

G_stat <- 2 * sum(observed * log(observed / expected), na.rm = TRUE)
dof <- length(observed) - 1
p_value <- pchisq(G_stat, df = dof)
alpha <- 0.05  # Significance level
if (p_value < alpha) {
  cat("Reject the null hypothesis")
} else {
  cat("Fail to reject the null hypothesis")
}
Fail to reject the null hypothesis
p_value
[1] 0.4885846
G_stat
[1] 10.2121

We can also do this test directly using R:

chisq.test(df$Votes, p = rep(1/12, 12), rescale.p = TRUE)

    Chi-squared test for given probabilities

data:  df$Votes
X-squared = 10.803, df = 11, p-value = 0.4599