(Version 2.0) The data files below describe the SNP data used for the
paper "Using population mixtures to optimize the utility of genomic
databases: linkage disequilibrium and association study design in
India" by TJ Pemberton, M Jakobsson, DF Conrad, G Coop, JD Wall, JK
Pritchard, PI Patel, NA Rosenberg (Annals of Human Genetics x:y-z
[2008]). These data are a combination of data from 2 Indian
populations and the data used for the paper "A worldwide survey of
haploype variation and linkage disequilibrium in the human genome" by
DF Conrad, M Jakobsson, G Coop, X Wen, JD Wall, NA Rosenberg, JK
Pritchard (Nature Genetics 38:1251-1260 [2006]).

*Version 2.0 of the package of files - created by Mattias, Feb 14, 2008

Versions numbered 1.x are associated with the paper of Conrad et
al. 2006 and not with the paper of Pemberton et al. 2008.

---------------------------------------------------------------------

The following data sets are available:

1. unphased_HGDP+India_regions1to36
(HGDP data+Indian data - 957 individuals, 2810 SNPs)

2. phased_HGDP+India_regions1to36
(HGDP data+Indian data - 957 individuals, 2810 SNPs)

3. phased_HapMap_regions1to22and27to36
(Phase 2 HapMap genotypes for SNPs in our data - 210 individuals, 
1853 SNPs)

4. phased_HGDP+India+HapMap_regions1to22and27to36 
(HGDP+Indian data+HapMap genotypes for SNPs in our data - 1167
individuals, 1853 SNPs)

5. unphased_HGDP+India_relativesincl_regions1to36
(HGDP+Indian data - 1047 individuals, 2810 SNPs)

File 1 is the raw HGDP and Indian unphased data, after elimination of
SNPs that failed quality checks and individuals who were related.

File 2 is the phased HGDP and Indian data with all missing genotypes
imputed, using the genomic "regions" labels (1-36).

File 3 is the phased HapMap data with all missing genotypes imputed,
for autosomal regions 1 to 22 and 27 to 36.  This file was created
from the HapMap collection by leaving out the offspring in CEU and YRI
trios.

File 4 is the combined HGDP, Indian, and HapMap data from Files 2 and
3, for regions 1 to 22 and 27 to 36.

File 5 is the raw unphased HGDP and Indian data, after elimination of
SNPs that failed quality checks, before removing individuals who were
related.

For one SNP (rs12123995), Files 1 and 2 appear to differ in strand
polarity from File 3 - that is, for this SNP, our data arrived with a
strand polarity different from that in the HapMap.  To create File 4,
the single SNP in the HapMap data in File 3 was repolarized to match
the HGDP+Indian data and was combined with the subset of SNPs in File
2 present in the HapMap.

The data are in "structure" format with 2 rows per individual.  

Rows:
1. rs number
2. region number (1..36)
3. chromosome number
4. snp position on chromosome
5. core region indicator (1 = core region, 0 = non-core region)
6...X*2+5: individual data for X individuals

Note that 3 of the core regions (regions 30, 31 and 32) contain large
gaps that we identified only after the design phase of the Conrad et
al. 2006 study was complete. For one analysis using core SNPs, we
split each of these regions into two regions. In this case, we
included 5 non-core SNPs (rs212858 in region 30, rs6467107 in region
32, rs6962580 in region 32, rs6467108 in region 32, and rs6467109 in
region 32) because each of these SNPs was located less than 200bp from
a core SNP or less than 200bp from a non-core SNP that was connected
to a core SNP by less than 200bp.

In the phased files (Files 2-4) each of the two rows for an individual
represents one of the two haplotypes.  Phasing was performed within
genomic regions, so there is no correspondence of haplotypes across
region boundaries.  In the unphased files (Files 1 and 5), the
placement of genotypes on the first versus second line for an
individual is arbitrary.

Columns for individual data (HGDP individuals):
1. HGDP ID number
2. numeric code for population
3. name of population
4. country of origin
5. geographic region of origin
6. ID number assigned during genotyping
7. sex
8... genotypes (A, C, G, T, or ? for missing data or hemizygous males 
   on the X-chromosome)

Columns for individual data (HapMap individuals):
1. HapMap ID number (string)
2. numeric code for individual - this code is not unique and repeats 
	across HapMap populations; the number may also be shared with a 
	population code used for a HGDP population
3. name of population (YRI, CEU, or JPT+CHB)
4. name of population (YRI, CEU, or JPT+CHB)
5. name of population (YRI, CEU, or JPT+CHB)
6. meaningless- a placeholder to make the number of columns match the HGDP data
7. meaningless- a placeholder to make the number of columns match the HGDP data
8... genotypes (A, C, G, T, or ? for missing data or hemizygous males 
   on the X-chromosome)
