[Biopython] still more questions about NGS sequence trimming

Peter Cock p.j.a.cock at googlemail.com
Thu Oct 25 15:58:04 UTC 2012


On Thu, Oct 25, 2012 at 4:34 PM, Kiss, Csaba <csaba.kiss at lanl.gov> wrote:
> I believe mothur does check the moving average quality of a sequence with
> a sliding window of 50 bp. If the quality falls below the given value then it
> tosses the sequence out. I don't think it does end trimming beside removing
> the small letters from the ends.  Of course, it  can remove adapter and primer
> sequences but that's not based on quality values.

Fine - the point is doing SeqIO.parse("example.sff", "sff-trim") does NOT do
any of that. All it does is apply the trimming information already recorded in
the SFF file by the provider (e.g. the Roche 454 instrument).

So back to your earlier question:

> On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba <csaba.kiss at lanl.gov> wrote:
>> Thanks, Peter. I am writing my quality functions. Another question
>> about trimming. As you mentioned, the quality of the ends tend to be
>> lower than in the middle. Could that be fixed just by using "sff-trim"
>> when I create my FASTQ file?

Using "sff-trim" would be sensible as a starting point, but you'll
still probably notice a drop off in quality along the read length.
This is normal.

>> If I don't do that I get sequences with small and capital letters.

The lower case bits are what Roche labelled as low quality or adapter.
The upper case bit is what Roche labelled as worth keeping after its
trimming, and it is this you'd get via SeqIO.parse("example.sff", "sff-trim").
You'll probably notice all the untrimmed sequences start with the same
four letters (in lower case).

>> Are you suggesting further trimming than just "sff-trim".

Yes, if you want to mimic what Mothur was doing for you.

Peter



More information about the Biopython mailing list