These files describe the exact data used in the PNAS article "Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa," and the PLoS Genetics article "Clines, clusters, and the effect of study design on the inference of human population structure." As described in these articles, there are some minor differences between the sets of individuals studied in the two articles, the set in the HGDP-CEPH Human Genome Diversity Cell Line Panel, and the set genotyped by the Mammalian Genotyping Service at the Marshfield Medical Research Foundation. Thus we provide the data in the form that we used it for the analysis. The loci in the new data include as a subset the 377 loci studied by Rosenberg et al. (2002). For these 377 loci, the names in the new data files match those in the previous files associated with the Rosenberg et al. (2002) paper. The new loci are presented with a name that contains an underscore followed by the chromosome number. Note that the indel MID2296_01 is completely missing all its genotypes in Russians, and the indel MID0731_05 is completely missing genotypes in Cambodians. With questions about these files, please contact me. Noah Rosenberg November 3, 2005 ------------------------------------------------------------------- 1. combined-1048.stru This file includes the exact data used by Rosenberg et al. (2005) --- both microsatellites and indels. The format is that used by the structure program. The first line gives the list of loci. After the first line, each individual is listed on two consecutive lines. The first five columns include the following information: (1) Individual code number assigned by CEPH. (2) Population code number assigned by us. (3) Population name. (4) Geographic information about the population. (5) Pre-defined region, as was used in Rosenberg et al. (2002). The next columns contain genotypes (measured in base pairs). The left-to-right order of the genotypes corresponds to the left-to-right order of the locus names on the first line of the file. The placement of genotypes on the first versus second line for an individual is arbitrary. Missing data is denoted by "-9". ------------------------------------------------------------------- 2. combinedmicrosats-1048.stru This file includes the exact data used by Rosenberg et al. (2005) --- microsatellites only. The format is that used by the structure program (see #1 above). ------------------------------------------------------------------- 3. indels-1048.stru This file includes the exact data used by Rosenberg et al. (2005) --- indels only. Genotypes are 1 for the "short" allele and 2 for the "long" allele. The format is that used by the structure program (see #1 above). ------------------------------------------------------------------- 4. combinedmicrosats-1048.nex This file includes the exact data used by Rosenberg et al. (2005) --- microsatellites only. The format is that used by the GDA (Genetic Data Analysis) program. This format, the NEXUS format, is further described on the GDA website. Briefly, each locus is listed on its own line. Each individual is then listed on a single line. Individuals are coded using their population names and their code numbers as assigned by CEPH. From left to right, diploid genotypes (measured in base pairs) follow the top-to-bottom order of the loci. Missing data is denoted by "?". At the bottom of the file are three "hierarchies," which correspond to different groupings of populations into regions for analysis of molecular variance. We have sometimes observed a compatibility problem in NEXUS files between Macs and PCs. If GDA produces an error, try adding or removing newline characters at the ends of NEXUS files, and then reload the file. ------------------------------------------------------------------- 5. indels-1048.nex This file includes the exact data used by Rosenberg et al. (2005) --- indels only. The format is that used by the GDA program. This format, the NEXUS format, is further described above (see #4) and on the GDA website. ------------------------------------------------------------------- 6. combinedmicrosats-1048.freqs This file contains the count estimates of allele frequencies based on the individual data in Rosenberg et al. (2005) --- microsatellites only. Each line gives the frequencies of an allele in all populations. Locus names are the same as in combinedmicrosats-1048.stru. The columns of this file include the following information: (1) Locus name. (2) Allele (measured in base pairs). (2n+1) Population. (2n+2) Estimated frequency in the population in column 2n+1. n ranges from 1 to 53 (there are 53 populations). ------------------------------------------------------------------- 7. indels-1048.freqs This file contains the count estimates of allele frequencies based on the individual data in Rosenberg et al. (2005) --- indels only. The format is the same as that for the corresponding file on microsatellites (see #6 above). For the loci MID2296_01 in Russians and MID0731 in Cambodians no frequencies are given (the notation "---" appears instead of allele frequencies). ------------------------------------------------------------------- 8. rosenbergEtAl2005.coordinates.txt This file contains the geographic coordinates for populations in the Rosenberg et al. (2005) paper. Each line in this file contains a population name, the latitude used for the population in degrees north, the longitude in degrees east, and the spherical coordinates for the populations. Note that in computation of A_n in the paper, the coordinates used for the southern African Bantu individuals were based on the lines for the individual populations (Herero, Ovambo, Pedi, Sotho, Tswana, Zulu). In the calculation of Fst in Fig. 6, the composite coordinates for BantuSouthAfrica were used instead. ------------------------------------------------------------------- 9. rosenbergEtAl2005.codes.txt This file contains code numbers that have been assigned to the populations in files associated with the Rosenberg et al. (2005) paper. The columns include the following information: (1) Population code number. (2) Population name. (3) Geographic information about the population. (4) Pre-defined region, as was used in the article. ------------------------------------------------------------------- 10. combinedmicrosats-1027.stru This file includes the exact data used by Ramachandran et al. (2005). The format is that used by the structure program (see #1 above). ------------------------------------------------------------------- 11. combinedmicrosats-1027.nex This file includes the exact data used by Ramachandran et al. (2005). The format is that used by the GDA program. This format, the NEXUS format, is further described above (see #4) and on the GDA website. ------------------------------------------------------------------- 12. ramachandranEtAl2005.coordinates.txt This file contains the geographic coordinates for populations in the Ramachandran et al. (2005) paper. Each line in this file contains a population name, the geographic region for the population, the latitude used for the population in degrees north, and the longitude in degrees east. ------------------------------------------------------------------- 13. ramachandranEtAl2005.codes.txt This file contains code numbers that have been assigned to the populations in files associated with the Ramachandran et al. (2005) paper. See #9 for the format of this file. -------------------------------------------------------------------