[Biopython] [Bioperl-l] Next-gen modules

Peter Cock p.j.a.cock at googlemail.com
Thu Jul 2 03:20:07 EDT 2009


On 7/1/09, Giles Weaver wrote:
> Peter, the trimming algorithm I use employs a sliding window, as follows:
>
>    - For each sequence position calculate the mean phred quality score for a
>    window around that position.
>    - Record whether the mean score is above or below a threshold as an array
>    of zeros and ones.
>    - Use a regular expression on the joined array to find the start and end
>    of the good quality sequence(s).
>    - Extract the quality sequence(s) and replace any bases below the quality
>    threshold with N.
>    - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the window, but
> this could give a performance hit, and the method seems to work well enough
> as is.

Thanks for the details - that is a bit more complex that what I had been
thinking. Do you have any favoured window size and quality threshold,
or does this really depend on the data itself?

Also, if you find a sequence read that goes "good - poor - good" for
example, do you extract the two good regions as two sub reads
(presumably with a minimum length)? This may be silly for Illumina
where the reads are very short, but might make sense for Roche 454.

> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Even use cases are useful - so thank you.

> Jonathan, some of the Illumina sequencing adapters are listed at
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> http://seqanswers.com/forums/showthread.php?t=198
> Adapter sequence typically appears towards the end of the read, though the
> latter part of it is often misread as the sequencing quality drops off.
> I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
> then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> real alignments and trim the sequence. This is not the ideal way of doing
> things, but it's fast enough, and does seem to work. The adapter sequence
> shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
> adapter removal.
>
> I'll happily share some code once I've got it to the stage where I'm not
> embarrassed by it!
>
> Giles

Cheers,

Peter


More information about the Biopython mailing list