BF Counter: Memory efficient K-mer counting Software
Pritchard Lab, Stanford University
BFCounter is a program for counting k-mers in DNA sequence
data. Counting k-mers (substrings of length k) is an essential
compononet of many methods in bioinformatics, including for genome
and transcriptome assembly, for metagenomic sequencing, and for
error correction of sequence reads. Although simple in principle,
counting k-mers in large modern sequence data sets can easily
overwhelm the memory capacity of standard computers. In current data
sets, a large fraction - often more than 50% - of the storage
capacity may be spent on storing k-mers that contain sequencing
errors and which are typically observed only a single time in the
data. These singleton k-mers are uninformative for many algorithms
without some kind of error correction.
BFCounter identifies all the k-mers that occur more than once in a
DNA sequence data set. Our method does this using a Bloom filter, a
probabilistic data structure that stores all the observed k-mers
implicitly in memory with greatly reduced memory requirements.
Publication: Melsted, P. and Pritchard, J.K.:
Efficient counting of k-mers in DNA sequences using a bloom
filter.BMC Bioinformatics 2011 12:333.
Download current version from
Github
repository. This version adds some features including
multithreading to earlier versions.
Changes
- Version 0.2
- Proper command line options and help messages
- Bloom filter size can be specified as a parameter
- Output saved to a binary file and can be converted
into tab-delimited text file containing all k-mers and counts
- Support for Quake based output counting q-mers and tab-based
output for use with the Quake error correction software
- Version 0.1 - This was the initial version which only counted
k-mers and was used to run the experiments in the paper.
Previous versions:
BFCounter 0.2
BFCounter 0.1
For questions or comments write to pmelsted at
gmail.com