Multiple hypothesis testing: A primer

The problem of multiple testing.

Imagine you’re developing a diagnostic panel to detect cancer. You’ve identified 500 genes that might be associated with cancer risk. Your task: determine which genes are truly associated with cancer vs. which associations are just due to chance.

To do this, you: 1. Measure expression levels of all 500 genes in cancer patients vs. healthy controls 2. Perform a statistical test (like a t-test) for each gene to see if expression differs between groups 3.

When you perform one statistical test with α = 0.05, you accept a 5% chance of a false positive (Type I error).

When you perform 500 independent tests, all with α = 0.05:

\[\text{Probability of at least one false positive} = 1 - (1 - 0.05)^{500}\]

Let’s calculate this:

\[\text{Probability} = 1 - (0.95)^{500} = 1 - 0.0000085 \approx 99.9999\%\]

You are almost guaranteed to find false positives!

Why does this happen?

If you flip a fair coin 5 times expecting all heads, the probability is low: (0.5)^5 = 3.1%. But if you flip 500 coins each once, how many will be heads just by chance? About 250!

In our gene panel: - We have 500 “coins” (genes) - We flip each once (perform one test per gene) - At α = 0.05, we expect 500 × 0.05 = 25 false positives just from random chance

Similarly, when we test 500 genes and find 30 genes with p-value < 0.05. At expected value, ~25 of these are false positives - Only ~5 might be truly associated with cancer

So publishing all 30 genes would mean: - 83% false discovery rate (25 false positives out of 30). Patients tested on these genes get misleading results. Everyone wastes time studying false leads.

This is the multiple testing problem.

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

# Set seed for reproducibility
set.seed(42)

mytheme <- theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    axis.title = element_text(size = 11),
    legend.position = "top"
  )

Visualising the problem

Let’s simulate what happens when we test 100 hypotheses where none are actually true (all null hypotheses are true):

n_tests <- 100
p_values <- runif(n_tests, 0, 1) # Under null, p-values are uniformly distributed

# Count "significant" results at alpha = 0.05
n_significant <- sum(p_values < 0.05)

# Create histogram
hist_data <- data.frame(p_value = p_values)

ggplot(hist_data, aes(x = p_value)) +
  geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7, color = "black") +
  geom_vline(xintercept = 0.05, color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text",
    x = 0.05, y = Inf,
    label = paste0(n_significant, " 'significant' \nfalse positives"),
    hjust = -0.1, vjust = 1.5, color = "red", size = 4
  ) +
  labs(
    title = "P-values from 100 Tests Where Null is True",
    subtitle = "Red line shows α = 0.05 threshold",
    x = "P-value",
    y = "Count"
  ) +
  mytheme

We found 8 “significant” results even though nothing was real! This is the problem we need to solve.

But again, what is the problem?

When we do multiple tests, we need to think about different types of errors:

error_matrix <- data.frame(
  Reality = c(
    "Null True (No Effect)", "Null True (No Effect)",
    "Null False (Real Effect)", "Null False (Real Effect)"
  ),
  Decision = c("Not Reject", "Reject", "Not Reject", "Reject"),
  Outcome = c(
    "Correct ✓", "FALSE POSITIVE ✗\n(Type I Error)",
    "FALSE NEGATIVE ✗\n(Type II Error)", "Correct ✓"
  ),
  Color = c("good", "bad", "bad", "good")
)

knitr::kable(error_matrix[, 1:3],
  col.names = c("Reality", "Your Decision", "Outcome"),
  caption = "Possible Outcomes in Hypothesis Testing"
)

Possible Outcomes in Hypothesis Testing
Reality	Your Decision	Outcome
Null True (No Effect)	Not Reject	Correct ✓
Null True (No Effect)	Reject	FALSE POSITIVE ✗
(Type I Error)
Null False (Real Effect)	Not Reject	FALSE NEGATIVE ✗
(Type II Error)
Null False (Real Effect)	Reject	Correct ✓

The two different error rates

Family-Wise Error Rate (FWER): The probability of making even one false positive. Like saying “I want zero mistakes” –> Very conservative - protects you strongly but may miss real discoveries. Bonferroni correction controls FWER.

False Discovery Rate (FDR): Among all discoveries you make, what proportion are false? Like saying “I’m okay if 5% of my findings are false positives” –> More flexible - lets you find more real effects while controlling false positives. Benjamini-Hochberg controls FDR

An intuitive analogy

Imagine you’re running a cancer diagnostic clinic and you’ve developed a panel of 500 genetic markers. You use it to screen patients:

Without Multiple testing correction

“We’ll flag anyone with abnormal markers as high-risk” - You flag 30 patients as having cancer risk - But 25 of them are actually fine (false alarms) - Result: 83% of your “positive” diagnoses are wrong - Patients get unnecessary anxiety, costly treatments, wrong decisions

FWER - Bonferroni (Ultra-Conservative)

“We want to be 95% certain we don’t mis-diagnose ANY patient” - You only flag patients with the clearest, most obvious markers - You correctly identify 3 patients with real cancer risk - BUT you miss 27 patients who actually have cancer risk (false negatives) - Result: Everyone you flag is definitely high-risk, but you let many sick patients walk out thinking they’re fine

FDR - Benjamini-Hochberg (balanced)

“We’re okay if 5% of our high-risk diagnoses turn out to be false” - You flag 8 patients as high-risk - You expect about 0.4 of them to ultimately be false alarms (5% of 8) - Result: You catch more real cases (8 vs 3), but accept that roughly 1 in 20 “high-risk” calls might be wrong - These 8 patients get further testing—some are confirmed, a few aren’t—but no major harm done

Summary

Bonferroni = Assuming your screening test is terrible at ruling out false alarms, so you only act on the most obvious cases

FDR = Accepting that some of your calls will be wrong (5%), but since you’re following up with confirmatory testing anyway, it’s fine—you catch more real cases

In a clinic, you’d never use Bonferroni alone because you’d miss too much cancer. You’d use FDR for the initial panel, then confirm with follow-up tests.

The Benjamini-Hochberg procedure

The Benjamini-Hochberg (BH) method is surprisingly simple and elegant. Here’s how it works:

The algorithm

Sort all p-values from smallest to largest
Number them: i = 1, 2, 3, …, m (where m is total number of tests)
Calculate threshold for each test: (i/m) × α (where α is your desired FDR, usually 0.05)
Find the largest i where: p-value ≤ (i/m) × α
Reject all hypotheses from 1 to i

Let’s see this using a demo:

# Simulate a more realistic scenario: 100 tests, 20 are truly significant
set.seed(123)
n_tests <- 100
n_true <- 20
n_null <- n_tests - n_true

# Generate p-values
# True alternatives: small p-values (from exponential distribution)
p_true <- rbeta(n_true, 0.5, 10)
# True nulls: uniform p-values
p_null <- runif(n_null, 0, 1)

# Combine and create dataset
p_values <- c(p_true, p_null)
true_status <- c(rep("True Effect", n_true), rep("No Effect", n_null))

test_data <- data.frame(
  test_id = 1:n_tests,
  p_value = p_values,
  truth = true_status
) %>%
  arrange(p_value) %>%
  mutate(
    rank = row_number(),
    bh_threshold = (rank / n_tests) * 0.05,
    bonf_threshold = 0.05 / n_tests,
    significant_bh = p_value <= bh_threshold,
    significant_bonf = p_value <= bonf_threshold
  )

# Find the BH cutoff
bh_cutoff <- max(which(test_data$significant_bh))

# Create the visualization
ggplot(test_data, aes(x = rank, y = p_value)) +
  # BH threshold line
  geom_line(aes(y = bh_threshold), color = "blue", linewidth = 1, linetype = "solid") +
  # Bonferroni threshold line
  geom_hline(yintercept = 0.05 / n_tests, color = "orange", linewidth = 1, linetype = "dashed") +
  # Alpha line
  geom_hline(yintercept = 0.05, color = "gray", linewidth = 0.5, linetype = "dotted") +
  # Points colored by truth
  geom_point(aes(color = truth, shape = truth), size = 3, alpha = 0.7) +
  # Highlight the BH cutoff
  geom_vline(xintercept = bh_cutoff, color = "blue", linetype = "dotted", alpha = 0.5) +
  # Annotations
  annotate("text",
    x = 80, y = 0.045, label = "BH Threshold\n(i/m) × α",
    color = "blue", size = 4, fontface = "bold"
  ) +
  annotate("text",
    x = 80, y = 0.0005, label = "Bonferroni\nα/m",
    color = "orange", size = 4, fontface = "bold"
  ) +
  annotate("text",
    x = bh_cutoff, y = 0.8,
    label = paste0("BH rejects ", bh_cutoff, " tests"),
    color = "blue", hjust = -0.1, size = 3.5
  ) +
  scale_color_manual(values = c("True Effect" = "#E74C3C", "No Effect" = "#95A5A6")) +
  scale_shape_manual(values = c("True Effect" = 16, "No Effect" = 1)) +
  labs(
    title = "Benjamini-Hochberg Procedure Visualization",
    subtitle = "Blue line shows BH threshold increasing with rank",
    x = "Rank (i) - sorted from smallest to largest p-value",
    y = "P-value",
    color = "True Status",
    shape = "True Status"
  ) +
  mytheme +
  theme(legend.position = "right")

What’s happening here?

The blue line is the BH threshold: it starts small and gradually increases. This is the key insight!

Small p-values (left side): Easy to reject –> the threshold is low but so are the p-values
Larger p-values (right side): Threshold increases, allowing more discoveries
The magic: BH finds the largest rank where p-value still crosses below the line

results_summary <- test_data %>%
  group_by(truth) %>%
  summarise(
    Total = n(),
    `BH Significant` = sum(significant_bh),
    `Bonf Significant` = sum(significant_bonf)
  )

knitr::kable(results_summary,
  caption = "Comparison: True effects vs. Detected effects"
)

Comparison: True effects vs. Detected effects
truth	Total	BH Significant	Bonf Significant
No Effect	80	1	0
True Effect	20	4	2

Notice how: - Benjamini-Hochberg found 5 significant tests - Bonferroni only found 2 significant tests - The truth: 20 tests actually have real effects

BH gives us more power to detect real effects while still controlling false discoveries!

Why does this work?

The genius of BH is in the increasing threshold. Here’s why it works:

The ranking creates protection

True effects tend to have small p-values → They cluster on the left
False positives are randomly scattered → They’re spread throughout
By sorting p-values, we naturally group most true effects together
The increasing threshold allows us to be more lenient as we move right, where we’re less likely to find true effects anyway

Visual intuition

# Zoom in on the first 30 tests
zoom_data <- test_data %>% filter(rank <= 30)

ggplot(zoom_data, aes(x = rank, y = p_value)) +
  # Step function showing BH decision
  geom_step(aes(y = bh_threshold), color = "blue", linewidth = 1.5, direction = "mid") +
  geom_point(aes(color = truth, shape = significant_bh), size = 4, alpha = 0.8) +
  geom_segment(aes(x = rank, xend = rank, y = 0, yend = p_value, color = truth),
    alpha = 0.3, linetype = "dotted"
  ) +
  scale_color_manual(values = c("True Effect" = "#E74C3C", "No Effect" = "#95A5A6")) +
  scale_shape_manual(
    values = c("TRUE" = 19, "FALSE" = 1),
    guide = guide_legend(title = "BH Rejects?")
  ) +
  labs(
    title = "Close-up: How BH makes decisions",
    subtitle = "Points below the blue line are rejected (significant)",
    x = "Rank (i)",
    y = "P-value",
    color = "True Status"
  ) +
  mytheme +
  theme(legend.position = "right")

Think of it like climbing a staircase: - You keep climbing (accepting tests as significant) - As long as p-values stay below the rising threshold - Once a p-value exceeds the threshold, you stop

The mathematical intuition

You don’t need heavy math to understand why BH works, but a little intuition helps!

Why the (i/m) × α formula?

The BH procedure uses the threshold: p(i) ≤ (i/m) × α

Let’s break down what each part means:

i = the rank of this p-value (1st smallest, 2nd smallest, etc.)
m = total number of tests
α = your target FDR (e.g., 0.05)

The intuition: As you accept more tests (increasing i), you’re allowed a slightly more lenient threshold. Why? Because:

True effects cluster at small p-values - they naturally rank early
The ratio i/m grows slowly - you’re not being too generous
The method “spends” your error budget across all tests efficiently

Example

Suppose you have 10 tests and want FDR = 0.05:

i <- 1:10
m <- 10
alpha <- 0.05
threshold <- (i / m) * alpha

math_example <- data.frame(
  Rank_i = i,
  `i/m` = i / m,
  `Threshold (i/m)×0.05` = threshold,
  `Interpretation` = c(
    "1st test: very strict",
    "2nd test: slight relaxation",
    "3rd test: more lenient",
    "4th test: continuing to grow",
    "5th test: halfway point",
    "6th test: getting lenient",
    "7th test: fairly relaxed",
    "8th test: quite lenient",
    "9th test: very lenient",
    "10th test: same as unadjusted!"
  )
)

knitr::kable(math_example[1:5, ],
  digits = 4,
  col.names = c("Rank (i)", "i/m", "Threshold", "What this means"),
  caption = "BH threshold calculation (first 5 tests)"
)

BH threshold calculation (first 5 tests)
Rank (i)	i/m	Threshold	What this means
1	0.1	0.005	1st test: very strict
2	0.2	0.010	2nd test: slight relaxation
3	0.3	0.015	3rd test: more lenient
4	0.4	0.020	4th test: continuing to grow
5	0.5	0.025	5th test: halfway point

Notice: - Rank 1: Threshold = 0.005 (very strict, like Bonferroni: 0.05/10) - Rank 5: Threshold = 0.025 (halfway to α) - Rank 10: Threshold = 0.050 (same as unadjusted!)

The increasing threshold is the key innovation!

Why does this control FDR?

The mathematical proof is complex, but here’s the intuitive idea:

Setup: Suppose you have m₀ true null hypotheses (no effect) and m₁ true alternatives (real effects).

Under the nulls: P-values are uniformly distributed between 0 and 1.

The clever part: When you sort all p-values together, the true nulls are “diluted” by the true alternatives (which have small p-values). By using the increasing threshold (i/m) × α, the procedure:

Mostly rejects true alternatives early (they have small p-values)
Occasionally accepts a null by mistake (but this is bounded!)
The math shows: On average, ≤ α × (m₀/m) ≤ α fraction of discoveries are false

Key insight: The more true effects you have, the better BH performs (true alternatives “protect” against false positives by filling up the low ranks).

Visualizing the Math

set.seed(444)

# Scenario 1: Few true effects (harder for BH)
n_tests <- 50
n_true_few <- 5
p_true_few <- rbeta(n_true_few, 0.5, 15)
p_null_few <- runif(n_tests - n_true_few, 0, 1)

# Scenario 2: Many true effects (easier for BH)
n_true_many <- 25
p_true_many <- rbeta(n_true_many, 0.5, 15)
p_null_many <- runif(n_tests - n_true_many, 0, 1)

# Create datasets
data_few <- data.frame(
  p_value = c(p_true_few, p_null_few),
  truth = c(rep("True", n_true_few), rep("Null", n_tests - n_true_few)),
  scenario = "10% True Effects"
) %>%
  arrange(p_value) %>%
  mutate(rank = row_number())

data_many <- data.frame(
  p_value = c(p_true_many, p_null_many),
  truth = c(rep("True", n_true_many), rep("Null", n_tests - n_true_many)),
  scenario = "50% True Effects"
) %>%
  arrange(p_value) %>%
  mutate(rank = row_number())

combined_data <- rbind(data_few, data_many) %>%
  mutate(bh_threshold = (rank / n_tests) * 0.05)

ggplot(combined_data, aes(x = rank, y = p_value)) +
  geom_line(aes(y = bh_threshold), color = "blue", linewidth = 1) +
  geom_point(aes(color = truth, shape = truth), size = 2.5, alpha = 0.7) +
  facet_wrap(~scenario, ncol = 2) +
  scale_color_manual(values = c("True" = "#E74C3C", "Null" = "#95A5A6")) +
  scale_shape_manual(values = c("True" = 16, "Null" = 1)) +
  labs(
    title = "More true effects = Better protection against false positives",
    subtitle = "True effects (red) fill the low ranks, pushing nulls (gray) to higher ranks",
    x = "Rank",
    y = "P-value",
    color = "Truth",
    shape = "Truth"
  ) +
  mytheme +
  theme(legend.position = "top")

When there are many true effects (right panel), they dominate the low ranks, making it harder for false positives (nulls) to sneak in. This is why BH becomes more efficient when there are more real effects!

Example: Gene expression analysis

Let’s walk through a realistic example with gene expression data.

# Simulate gene expression study
set.seed(456)
n_genes <- 1000
n_true_de <- 50 # 50 genes truly differentially expressed

# Simulate p-values
p_values_real <- c(
  rbeta(n_true_de, 0.3, 15), # True DE genes: small p-values
  runif(n_genes - n_true_de, 0, 1) # Non-DE genes: uniform
)

gene_data <- data.frame(
  gene_id = paste0("Gene_", 1:n_genes),
  p_value = p_values_real,
  truth = c(rep("DE", n_true_de), rep("Not DE", n_genes - n_true_de))
) %>%
  arrange(p_value) %>%
  mutate(
    rank = row_number(),
    # Apply different methods
    bh_threshold = (rank / n_genes) * 0.05,
    bh_significant = p_value <= bh_threshold,
    bonf_significant = p_value <= (0.05 / n_genes),
    unadjusted_significant = p_value <= 0.05
  )

# Calculate discoveries and false discoveries for each method
calc_performance <- function(data, sig_col, truth_col = "truth") {
  true_pos <- sum(data[[sig_col]] & data[[truth_col]] == "DE")
  false_pos <- sum(data[[sig_col]] & data[[truth_col]] == "Not DE")
  true_neg <- sum(!data[[sig_col]] & data[[truth_col]] == "Not DE")
  false_neg <- sum(!data[[sig_col]] & data[[truth_col]] == "DE")

  total_discoveries <- true_pos + false_pos
  fdr_actual <- if (total_discoveries > 0) false_pos / total_discoveries else 0

  data.frame(
    True_Positives = true_pos,
    False_Positives = false_pos,
    Total_Discoveries = total_discoveries,
    FDR_Actual = round(fdr_actual, 3),
    Power = round(true_pos / 50, 3) # 50 true DE genes
  )
}

performance <- data.frame(
  Method = c("Unadjusted (α=0.05)", "Bonferroni", "Benjamini-Hochberg"),
  rbind(
    calc_performance(gene_data, "unadjusted_significant"),
    calc_performance(gene_data, "bonf_significant"),
    calc_performance(gene_data, "bh_significant")
  )
)

knitr::kable(performance,
  caption = "Performance Comparison Across Methods",
  digits = 3
)

Performance Comparison Across Methods
Method	True_Positives	False_Positives	Total_Discoveries	FDR_Actual	Power
Unadjusted (α=0.05)	45	41	86	0.477	0.90
Bonferroni	5	0	5	0.000	0.10
Benjamini-Hochberg	14	1	15	0.067	0.28

What do we learn?

# Reshape for plotting
perf_long <- performance %>%
  select(Method, True_Positives, False_Positives) %>%
  pivot_longer(
    cols = c(True_Positives, False_Positives),
    names_to = "Type", values_to = "Count"
  )

ggplot(perf_long, aes(x = Method, y = Count, fill = Type)) +
  geom_col(position = "stack", alpha = 0.8) +
  geom_text(aes(label = Count),
    position = position_stack(vjust = 0.5),
    color = "white", fontface = "bold", size = 5
  ) +
  scale_fill_manual(
    values = c(
      "True_Positives" = "#27AE60",
      "False_Positives" = "#E74C3C"
    ),
    labels = c("False Positives", "True Positives")
  ) +
  labs(
    title = "Discoveries by method",
    subtitle = "Green = correct discoveries, Red = false positives",
    x = NULL,
    y = "Number of genes called significant (DE)",
    fill = NULL
  ) +
  mytheme +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

Unadjusted: Finds many true positives BUT also many false positives (high FDR)
Bonferroni: Very few false positives BUT misses many true discoveries (low power)
Benjamini-Hochberg: Balances both - good power with controlled FDR ✓

The actual FDR for BH is 0.067, which is at or below our target of 0.05!

Comparing methods

A comprehensive comparison showing the different “rejection regions”:

comparison_data <- gene_data %>%
  filter(rank <= 200) %>%
  mutate(
    rejection_status = case_when(
      bh_significant & bonf_significant ~ "Both BH & Bonferroni",
      bh_significant & !bonf_significant ~ "BH only",
      !bh_significant & bonf_significant ~ "Bonferroni only",
      TRUE ~ "Neither"
    )
  )

ggplot(comparison_data, aes(x = rank, y = p_value)) +
  geom_line(aes(y = bh_threshold, color = "BH threshold"), linewidth = 1) +
  geom_hline(aes(yintercept = 0.05 / n_genes, color = "Bonferroni threshold"),
    linewidth = 1, linetype = "dashed"
  ) +
  geom_point(aes(shape = truth, fill = rejection_status),
    size = 3, alpha = 0.7, stroke = 1
  ) +
  scale_color_manual(
    name = "Thresholds",
    values = c(
      "BH threshold" = "blue",
      "Bonferroni threshold" = "orange"
    )
  ) +
  scale_shape_manual(values = c("DE" = 21, "Not DE" = 24)) +
  scale_fill_manual(
    name = "Rejection status",
    values = c(
      "Both BH & Bonferroni" = "#27AE60",
      "BH only" = "#3498DB",
      "Bonferroni only" = "#F39C12",
      "Neither" = "white"
    )
  ) +
  labs(
    title = "Compare methods",
    subtitle = "First 200 genes sorted by p-value",
    x = "Rank",
    y = "p-value"
  ) +
  mytheme +
  theme(legend.position = "right") +
  guides(shape = guide_legend(title = "Truth", override.aes = list(fill = "gray")))