[Biopython] Correcting short read errors based on k-mer coverage

Mon Sep 28 11:10:41 UTC 2009

On Fri, Sep 25, 2009 at 5:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>
>> 2009/9/25 Peter <biopython at maubp.freeserve.co.uk>:
>>>
>>> I just tried with a short read file from the NCBI SRA with ~7 million reads
>>> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total
>>> ~100 million kmers in total, and found about ~18 million different kmers.
>>> About half occurred only once.
>>>
>>> My naive code to count the kmers used a Python dictionary (k-mer
>>> strings as the keys, integer counts as values). It took about 5 minutes
>>> to run and about 1.5 GB of RAM.

An alternative approach reduced the memory needed for this example
from 1.25GB resident to about 0.8GB resident, while still taking about
5 mins. Instead of storing the kmers as strings, I encoded them as large
integers (basically using 2 bits per letter instead of 8 bits). This means
for kmers up to and including 32-mers, you need only a 64bit unsigned
long. You can do this in Python, but my initial code was a bit slow - so
I redid it as a Python C extension. The only problem here is what to do
with ambiguous sequences - for example any N characters?

This still used a Python dictionary to hold the (integer) encoded kmer
sequences as keys, and their (integer) counts as values. As noted
before, there are disk based options here like an SQLite database.

Peter