[ Home ] [ Software ] [ Lab Members ] [ Publications ] [ Data Archive ] [ Contact Us ]

 

Analysis of genome-wide SNP data

The advent of new data sets that contain hundreds of thousands of SNPs presents computational challenges for structure, as follows.

Computational speed.

For very large data sets, the runtime of structure using default settings may become impractically slow. At some point we hope to implement a new version that should run much faster. In the meantime you might consider a few strategies. (1) You will probably get similar answers with a random set of 10K SNPs as you do with 500K SNPs. You might want to do all of your exploratory analysis with reduced data sets. (2) The default settings for BURNIN and NUMREPS are very conservative (i.e., longer than strictly necessary) for typical data sets. We anticipate that you may be able to get accurate results using much shorter runs than the default. In particular, you should be able to get accurate estimates of the ancestry using small values of NUMREPS.

Program exceeds computer memory.

The standard structure release for 32-bit machines cannot handle extremely large data sets. (The maximum data set size for the standard release is probably somewhere around 100 million total genotypes.) Such data sets can be analyzed on 64-bit machines. You will need to download the source code from our website and compile it on your machine. As currently configured, the Java front end cannot load extremely large data sets and you will need to use the command-line version of structure.

ELF 64-bit LSB executable, AMD x86-64, for GNU/Linux 2.4.0, statically linked