Distribution of scheduled tribes across districts (Census 2011)

Loading data...

Tribe Name	Population ↓	% of District	% of National

Browse all tribes

Tribe name	Districts	Total population ↓	States	Action

Mother tongue diversity across districts (Census 2011)

Loading primary language diversity data...

Language	Population ↓	% of District

Understanding mother tongue diversity

What is Primary level?

Primary (Main Languages): Top-level language categories such as HINDI, BENGALI, TAMIL, TELUGU, MARATHI, etc.

Shannon Diversity Index (H)

Formula: H = -Σ(p_i × ln(p_i))

H = 0: Everyone speaks the same language
H = 0-1: Very low diversity (1-3 languages)
H = 1-2: Moderate diversity (3-7 languages)
H > 2: High diversity (7+ languages)

Mother tongue diversity (detailed) across districts (Census 2011)

Loading detailed language diversity data...

Language	Population ↓	% of District

Understanding mother tongue diversity (detailed)

What is Detailed level?

Detailed (Sub-languages): Dialects and variants within primary language categories.

Browse mother tongue (main languages)

Language name	Districts	Total speakers ↓	States	Action

Browse mother tongue (detailed sub-languages)

Sub-language name	Main language (L1)	Districts	Total speakers ↓	States	Action

Language diversity by district

District	State	Effective languages (L1) ↓	Total languages (L1)	Effective languages (L2)	Total languages (L2)	Top languages (L1)

Scheduled Tribe diversity by district

District	State	Effective tribes ↓	Total tribes	ST population	ST % of district	Top tribes

Isolated Tribes Distribution (Census 2011)

Tribes in ≤3 districts OR population <1,000

Loading isolated tribes data...

Isolated Tribes Directory

Tribe Name	District	State	% of District ↓	District Pop.	Total Pop.

Select State:

Display:

Select a state to view ST distribution

Loading state data...

ST Population %

0% 100%

Tribe Name	Population ↓	% of District	% of National

Select State:

Select a state to view language diversity

Loading state data...

Effective Languages (L1)

1 10

L1 - Mother Tongue

Language	Speakers ↓	% of District

L2 - Mother Tongue (Detailed)

Language	Speakers ↓	% of District

Genomic sampling allocation

Metric:

Colors:

The problem

India has ~1.4 billion people distributed across 4,600+ endogamous groups (jatis and tribes) stratified by caste, language, and geography. We want to sequence 1 million genomes. The central question: how do we allocate these samples to maximise genetic diversity captured?

This is not a standard epidemiological survey. In epi, you want estimates representative of the population mean. In population genomics, you want to capture maximum genetic variation — which means oversampling rare, isolated, genetically distinct groups relative to their population share.

Why language and caste, not geography?

The single most important finding comes from Eaaswarkhanth et al. (2021): language family + caste/tribe status predict genetic structure (r² = 0.93) far better than geography alone (r² = 0.17). This is the opposite of Europe. The reason is endogamy: India's groups have practiced strict within-group marriage for 70+ generations (~1,575 years; Basu et al. 2016), creating thousands of genetically isolated pockets that co-exist in the same geographic space.

Basu et al. (2016) identified five distinct ancestral components: ANI (Ancestral North Indian), ASI (Ancestral South Indian), AAA (Ancestral Austro-Asiatic), ATB (Ancestral Tibeto-Burman), and AND (Andamanese). Most Indians are ANI+ASI admixtures, but tribal groups can be near-pure representatives of specific components. The 1000 Genomes Project's five Indian populations primarily capture ANI ancestry, leaving ASI, AAA, and ATB largely unrepresented (Karunakarane et al. 2016).

Within-group saturation: why ~150 samples per group suffices

Nakatsuka et al. (2017) showed that 81 out of 263 South Asian groups experienced founder events stronger than both Ashkenazi Jews and Finns. Within an endogamous group, genetic drift eliminates rare variants and pushes remaining ones to higher frequency. General theory (Wendl & Wilson 2009) shows ~150 samples captures 80% of variants at ≥0.1% frequency. In bottlenecked populations, variants that would be at 0.01% get pushed to 0.1–1%, meaning even fewer samples suffice.

The implication: after ~150 samples from an endogamous group, you have captured most of the common and founder-amplified variation. The 10,001st sample from the same jati gives you almost nothing new. But the first sample from an unsampled jati gives you ~30% novel variants (PMC6760067). Breadth across groups matters far more than depth within groups.

The allocation design

The genetic unit is the endogamous group (jati/tribe), not the district. Two jatis in the same district can be as genetically different as Europeans and East Asians. Districts remain the logistic unit (where to set up collection centres), but allocation is driven by the number of distinct endogamous groups reachable from each district.

We estimate groups per district using two census proxies:

Effective languages (from Shannon entropy of mother tongue data): linguistic diversity predicts genetic diversity at r² = 0.93. Each effective language plausibly contains ≥2 endogamous strata (upper caste + lower caste at minimum).
Number of distinct tribes (from Census 2011 C-16 data): tribal groups are the most genetically distinct populations.
Floor of 3: every district has at minimum SC + ST + General categories.

est_groups = max(3, round(effective_languages × 2) + n_tribes)
n_samples = est_groups × 150, rescaled to sum to 1,000,000

Within-district allocation

Within each district, samples are split across SC / ST / General categories with equal per-group targets rather than proportional-to-population. ST groups get samples proportional to the number of distinct tribes present; SC and General groups share the remainder proportional to estimated sub-groups.

Survey weights

Because we deliberately oversample small groups, survey weights are essential for population-level statements ("X% of Indians carry variant Y"). Each individual is weighted by the inverse selection probability: w_i = N_group / n_group.

However, most population genetics analyses (PCA, ADMIXTURE, GWAS, phylogenetics) do not use survey weights — they model genetic structure where demographic weighting distorts the signal.

Comparison with existing projects

Project	N	Groups	Strategy	Limitation
1000 Genomes (India)	978	5 linguistic	Convenience	93% ANI ancestry
IndiGen (CSIR)	1,029	Diverse	Purposive	Small N
GenomeAsia 100K	598	7 categories	Tribal focus	Southern bias
GenomeIndia	10,000	83 communities	Stratified purposive	<2% of groups
This design	1,000,000	~4,600 groups	Group-saturation	Census proxy

Key references

Eaaswarkhanth et al. (2021). Integrating linguistics, social structure, and geography to model genetic diversity within India. MBE 38(5):1809. PMC8097304
Basu et al. (2016). Genomic reconstruction of the history of extant populations of India. PNAS 113(6):1594. PMC4760789
Nakatsuka et al. (2017). The promise of discovering population-specific disease-associated genes in South Asia. Nature Genetics 49:1403. PMC5675555
Wendl & Wilson (2009). The theory of discovering rare variants via DNA sequencing. BMC Bioinformatics 10:275. PMC2778663
Karunakarane et al. (2016). Population stratification and underrepresentation of Indian genetic diversity in 1000 Genomes. GBE 8(11):3460. PMC5203783
50,000 years of evolutionary history of India (2025). Cell. DOI
Genomics of rare genetic diseases in India (2019). Human Genomics 13:52. PMC6760067

Loading genomic sampling data...

Sampling rate (per 1,000)

0 10

District allocation directory

District	State	Region	Population	SC%	ST%	Eff. languages	Tribes	Est. groups	Samples ↓	Rate (per 1k)