[Biopython] [Bioperl-l] Next-gen modules

Fri Jul 3 11:35:00 EDT 2009

Regarding the trimming algorithm, I've been using a window size of 5, a
minimum score of 20 and a minimum length of 15 with the Illumina data. In
the past I have used a similar algorithm with a larger window size and much
longer minimum length with sequence from ABI 3XXX machines. I imagine that
the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to
those for Illumina and Sanger sequencing respectively.
Window size doesn't appear to affect performance much, if at all.

For sequences with multiple good regions, I do extract all good regions.
Even with the Illumina data there are sometimes two good regions, but
usually the second is adapter or junk and gets filtered out later. I haven't
seen quality data from a 454 machine recently, and would be interested to
know if multiple good regions are commonplace in 454 data. Can anyone with
access to 454 data comment on this?

Giles

2009/7/2 Peter Cock <p.j.a.cock at googlemail.com>

> On 7/1/09, Giles Weaver wrote:
> > Peter, the trimming algorithm I use employs a sliding window, as follows:
> >
> >    - For each sequence position calculate the mean phred quality score
> for a
> >    window around that position.
> >    - Record whether the mean score is above or below a threshold as an
> array
> >    of zeros and ones.
> >    - Use a regular expression on the joined array to find the start and
> end
> >    of the good quality sequence(s).
> >    - Extract the quality sequence(s) and replace any bases below the
> quality
> >    threshold with N.
> >    - Trim any Ns from the ends.
> >
> > A refinement would be to weight the scores from positions in the window,
> but
> > this could give a performance hit, and the method seems to work well
> enough
> > as is.
>
> Thanks for the details - that is a bit more complex that what I had been
> thinking. Do you have any favoured window size and quality threshold,
> or does this really depend on the data itself?
>
> Also, if you find a sequence read that goes "good - poor - good" for
> example, do you extract the two good regions as two sub reads
> (presumably with a minimum length)? This may be silly for Illumina
> where the reads are very short, but might make sense for Roche 454.
>
> > Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> > parsing a workout soon. Peter, as much as I'd love to help out with
> > biopython, I'm under too much time pressure right now!
>
> Even use cases are useful - so thank you.
>
> > Jonathan, some of the Illumina sequencing adapters are listed at
> >
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> > http://seqanswers.com/forums/showthread.php?t=198
> > Adapter sequence typically appears towards the end of the read, though
> the
> > latter part of it is often misread as the sequencing quality drops off.
> > I abuse needle (EMBOSS) into aligning the adapter sequence with each
> read. I
> > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> > real alignments and trim the sequence. This is not the ideal way of doing
> > things, but it's fast enough, and does seem to work. The adapter sequence
> > shouldn't be gapped, so I'm sure there is a lot of scope for optimising
> the
> > adapter removal.
> >
> > I'll happily share some code once I've got it to the stage where I'm not
> > embarrassed by it!
> >
> > Giles
>
> Cheers,
>
> Peter
>