[Biopython] Correcting short read errors based on k-mer coverage

Peter biopython at maubp.freeserve.co.uk
Fri Sep 25 15:42:39 UTC 2009


Dan Bolser <dan.bolser at gmail.com> wrote:
> Step 1 uses quality to select high quality regions of reads. these
> reads are broken down into k-mers (say of length 21), and then you
> construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT
> occurs 5000 times in my read set. Here you need to consider memory
> usage.

I just tried with a short read file from the NCBI SRA with ~7 million reads
of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total
~100 million kmers in total, and found about ~18 million different kmers.
About half occurred only once.

My naive code to count the kmers used a Python dictionary (k-mer
strings as the keys, integer counts as values). It took about 5 minutes
to run and about 1.5 GB of RAM.

What sized files are you hoping to run this on? Without knowing that,
it is hard to say if this simple dictionary approach will scale well.

Dan Bolser <dan.bolser at gmail.com> wrote:
> In step 2 you take the full reads (ignoring qualities) and look at the
> k-mer frequency (average?) at each base. Some bases will have a very
> low k-mer frequency, indicating sequencing errors.

Are you suggesting following the method of Chaisson et al 2009,
described in section "Detecting and error correcting accurate read
prefixes" of that paper - or something a little different? That section
itself cites several related approaches to read correction.

Peter



More information about the Biopython mailing list