Distribution of scheduled tribes across districts (Census 2011)
Loading data...
| Tribe Name | Population ↓ | % of District | % of National |
|---|
Browse all tribes
| Tribe name | Districts | Total population ↓ | States | Action |
|---|
Mother tongue diversity across districts (Census 2011)
Loading primary language diversity data...
| Language | Population ↓ | % of District |
|---|
Understanding mother tongue diversity
What is Primary level?
Primary (Main Languages): Top-level language categories such as HINDI, BENGALI, TAMIL, TELUGU, MARATHI, etc.
Shannon Diversity Index (H)
Formula: H = -Σ(pi × ln(pi))
- H = 0: Everyone speaks the same language
- H = 0-1: Very low diversity (1-3 languages)
- H = 1-2: Moderate diversity (3-7 languages)
- H > 2: High diversity (7+ languages)
Mother tongue diversity (detailed) across districts (Census 2011)
Loading detailed language diversity data...
| Language | Population ↓ | % of District |
|---|
Understanding mother tongue diversity (detailed)
What is Detailed level?
Detailed (Sub-languages): Dialects and variants within primary language categories.
Browse mother tongue (main languages)
| Language name | Districts | Total speakers ↓ | States | Action |
|---|
Browse mother tongue (detailed sub-languages)
| Sub-language name | Main language (L1) | Districts | Total speakers ↓ | States | Action |
|---|
Language diversity by district
| District | State | Effective languages (L1) ↓ | Total languages (L1) | Effective languages (L2) | Total languages (L2) | Top languages (L1) |
|---|
Scheduled Tribe diversity by district
| District | State | Effective tribes ↓ | Total tribes | ST population | ST % of district | Top tribes |
|---|
Isolated Tribes Distribution (Census 2011)
Tribes in ≤3 districts OR population <1,000
Loading isolated tribes data...
Isolated Tribes Directory
| Tribe Name | District | State | % of District ↓ | District Pop. | Total Pop. |
|---|
Select a state to view ST distribution
Loading state data...
| Tribe Name | Population ↓ | % of District | % of National |
|---|
Select a state to view language diversity
Loading state data...
L1 - Mother Tongue
| Language | Speakers ↓ | % of District |
|---|
L2 - Mother Tongue (Detailed)
| Language | Speakers ↓ | % of District |
|---|
Genomic sampling allocation
The problem
India has ~1.4 billion people distributed across 4,600+ endogamous groups (jatis and tribes) stratified by caste, language, and geography. We want to sequence 1 million genomes. The central question: how do we allocate these samples to maximise genetic diversity captured?
This is not a standard epidemiological survey. In epi, you want estimates representative of the population mean. In population genomics, you want to capture maximum genetic variation — which means oversampling rare, isolated, genetically distinct groups relative to their population share.
Why language and caste, not geography?
The single most important finding comes from Eaaswarkhanth et al. (2021): language family + caste/tribe status predict genetic structure (r² = 0.93) far better than geography alone (r² = 0.17). This is the opposite of Europe. The reason is endogamy: India's groups have practiced strict within-group marriage for 70+ generations (~1,575 years; Basu et al. 2016), creating thousands of genetically isolated pockets that co-exist in the same geographic space.
Basu et al. (2016) identified five distinct ancestral components: ANI (Ancestral North Indian), ASI (Ancestral South Indian), AAA (Ancestral Austro-Asiatic), ATB (Ancestral Tibeto-Burman), and AND (Andamanese). Most Indians are ANI+ASI admixtures, but tribal groups can be near-pure representatives of specific components. The 1000 Genomes Project's five Indian populations primarily capture ANI ancestry, leaving ASI, AAA, and ATB largely unrepresented (Karunakarane et al. 2016).
Within-group saturation: why ~150 samples per group suffices
Nakatsuka et al. (2017) showed that 81 out of 263 South Asian groups experienced founder events stronger than both Ashkenazi Jews and Finns. Within an endogamous group, genetic drift eliminates rare variants and pushes remaining ones to higher frequency. General theory (Wendl & Wilson 2009) shows ~150 samples captures 80% of variants at ≥0.1% frequency. In bottlenecked populations, variants that would be at 0.01% get pushed to 0.1–1%, meaning even fewer samples suffice.
The implication: after ~150 samples from an endogamous group, you have captured most of the common and founder-amplified variation. The 10,001st sample from the same jati gives you almost nothing new. But the first sample from an unsampled jati gives you ~30% novel variants (PMC6760067). Breadth across groups matters far more than depth within groups.
The allocation design
The genetic unit is the endogamous group (jati/tribe), not the district. Two jatis in the same district can be as genetically different as Europeans and East Asians. Districts remain the logistic unit (where to set up collection centres), but allocation is driven by the number of distinct endogamous groups reachable from each district.
We estimate groups per district using two census proxies:
- Effective languages (from Shannon entropy of mother tongue data): linguistic diversity predicts genetic diversity at r² = 0.93. Each effective language plausibly contains ≥2 endogamous strata (upper caste + lower caste at minimum).
- Number of distinct tribes (from Census 2011 C-16 data): tribal groups are the most genetically distinct populations.
- Floor of 3: every district has at minimum SC + ST + General categories.
est_groups = max(3, round(effective_languages × 2) + n_tribes)
n_samples = est_groups × 150, rescaled to sum to 1,000,000
Within-district allocation
Within each district, samples are split across SC / ST / General categories with equal per-group targets rather than proportional-to-population. ST groups get samples proportional to the number of distinct tribes present; SC and General groups share the remainder proportional to estimated sub-groups.
Survey weights
Because we deliberately oversample small groups, survey weights are essential for population-level statements ("X% of Indians carry variant Y"). Each individual is weighted by the inverse selection probability: wi = Ngroup / ngroup.
However, most population genetics analyses (PCA, ADMIXTURE, GWAS, phylogenetics) do not use survey weights — they model genetic structure where demographic weighting distorts the signal.
Comparison with existing projects
| Project | N | Groups | Strategy | Limitation |
|---|---|---|---|---|
| 1000 Genomes (India) | 978 | 5 linguistic | Convenience | 93% ANI ancestry |
| IndiGen (CSIR) | 1,029 | Diverse | Purposive | Small N |
| GenomeAsia 100K | 598 | 7 categories | Tribal focus | Southern bias |
| GenomeIndia | 10,000 | 83 communities | Stratified purposive | <2% of groups |
| This design | 1,000,000 | ~4,600 groups | Group-saturation | Census proxy |
Key references
- Eaaswarkhanth et al. (2021). Integrating linguistics, social structure, and geography to model genetic diversity within India. MBE 38(5):1809. PMC8097304
- Basu et al. (2016). Genomic reconstruction of the history of extant populations of India. PNAS 113(6):1594. PMC4760789
- Nakatsuka et al. (2017). The promise of discovering population-specific disease-associated genes in South Asia. Nature Genetics 49:1403. PMC5675555
- Wendl & Wilson (2009). The theory of discovering rare variants via DNA sequencing. BMC Bioinformatics 10:275. PMC2778663
- Karunakarane et al. (2016). Population stratification and underrepresentation of Indian genetic diversity in 1000 Genomes. GBE 8(11):3460. PMC5203783
- 50,000 years of evolutionary history of India (2025). Cell. DOI
- Genomics of rare genetic diseases in India (2019). Human Genomics 13:52. PMC6760067
Loading genomic sampling data...
District allocation directory
| District | State | Region | Population | SC% | ST% | Eff. languages | Tribes | Est. groups | Samples ↓ | Rate (per 1k) |
|---|