BF Counter: Memory efficient K-mer counting Software

Pritchard Lab, Stanford University




BFCounter is a program for counting k-mers in DNA sequence data. Counting k-mers (substrings of length k) is an essential compononet of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction - often more than 50% - of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.

BFCounter identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements.

Publication: Melsted, P. and Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter.BMC Bioinformatics 2011 12:333.

Download current version from Github repository. This version adds some features including multithreading to earlier versions.

Changes

  • Version 0.2
    • Proper command line options and help messages
    • Bloom filter size can be specified as a parameter
    • Output saved to a binary file and can be converted into tab-delimited text file containing all k-mers and counts
    • Support for Quake based output counting q-mers and tab-based output for use with the Quake error correction software
  • Version 0.1 - This was the initial version which only counted k-mers and was used to run the experiments in the paper.

Previous versions: BFCounter 0.2 BFCounter 0.1

For questions or comments write to pmelsted at gmail.com