[Biopython] Correcting short read errors based on k-mer coverage

Fri Sep 25 14:04:00 UTC 2009

Hi all,

This email was an offshoot of this thread on the Velvet user's list,
and Dan suggested I could CC the Biopython mailing list. See also:

http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000581.html

The method Dan describes looks like an interesting computational
challenge, but should be possible in (Bio)python...

Peter

---------- Forwarded message ----------
From: Dan Bolser <dan.bolser at gmail.com>
Date: Fri, Sep 25, 2009 at 2:39 PM
Subject: Re: [Velvet-users] Read length as a parameter?
To: peter at maubp.freeserve.co.uk
Cc: Daniel Zerbino <zerbino at ebi.ac.uk>

2009/9/25 Peter <peter at maubp.freeserve.co.uk>:
> On Fri, Sep 25, 2009 at 11:56 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>
>> Hi Peter,
>>
>> Thanks for the examples.
>>
>> Since your clearly keen to show off the power of bioperl, here *is* a
>> challenge ;-)
>>
>> 1) Construct a k-mer frequency distribution from the set of quality
>> trimmed reads.
>> 2) Correct full length reads by making point mutations that change
>> 'rare' k-mers into 'common' k-mers.
>> 3) Re-trim reads according to kmer frequency after correction.
>>
>> 4) (For extra credit), implement step 2 and 3 but include homo-polymer
>> length variability (indels) in the set of allowed correction
>> operations.
>>
>> I really think 'code jamboree' could be a lot of fun (given the rate
>> of technology change).
>>
>> I'd be seriously impressed at any reasonable 'module' to crack the above!
>
> Hi Dan,
>
> Did you mean to send this off list?

I figured it wasn't really relevant to velvet mailing list, but please
feel free to cc biopython.

> BioPerl/Biopython jokes aside, right now I don't understand exactly
> what you are asking for - although with a little more background reading
> I could probably work it out. Presumably all of this is ignoring the
> FASTQ quality scores? e.g. it would be fine just to work with FASTA
> files? In step (2) you want to edit the reads (giving a new FASTA file)?
> What you want in step (3) is unclear.

Sorry, I took quite some short cuts in my description. Please see:

http://www.ncbi.nlm.nih.gov/pubmed/19056694

Step 1 uses quality to select high quality regions of reads. these
reads are broken down into k-mers (say of length 21), and then you
construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT
occurs 5000 times in my read set. Here you need to consider memory
usage.

In step 2 you take the full reads (ignoring qualities) and look at the
k-mer frequency (average?) at each base. Some bases will have a very
low k-mer frequency, indicating sequencing errors.

Such bases can sometimes be unambiguously 'fixed' (changed to have
mean k-mer frequency) making single base substitutions.

Finally 'unfixable' reads (by the above definition) can be trimmed.

HTH,
Dan.