[Biopython] [Bioperl-l] Next-gen modules

Wed Jul 1 08:35:14 EDT 2009

Peter,

I just committed a fix to FASTQ parsing last night to support read/ 
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
> this kind of discussion in future.
>
> In the mean time, my replies to Giles below cover both BioPerl
> and Biopython (and EMBOSS). Giles' original email is here:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>
> Peter
>
> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>
>> I'm developing a transcriptomics database for use with next-gen  
>> data, and
>> have found processing the raw data to be a big hurdle.
>>
>> I'm a bit late in responding to this thread, so most issues have  
>> already
>> been discussed. One thing that hasn't been mentioned is removal of  
>> adapters
>> from raw Illumina sequence. This is a PITA, and I'm not aware of  
>> any well
>> developed and documented open source software for removal of adapters
>> (and poor quality sequence) from Illumina reads.
>>
>> My current Illumina sequence processing pipeline is an unholy mix of
>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for  
>> converting
>> the Illumina fastq to Sanger fastq, bioperl to read the quality  
>> values,
>> pure perl to trim the poor quality sequence from each read, and  
>> bioperl
>> with emboss to remove the adapter sequence. I'm aware that the  
>> pipeline
>> contains bugs and would like to simplify it, but at least it does  
>> work...
>>
>> Ideally I'd like to replace as much of the pipeline as possible with
>> bioperl/bioperl-run, but this isn't currently possible due to both  
>> a lack
>> of features and poor performance. I'm sure the features will come  
>> with
>> time, but the performance is more of a concern to me. ..
>
> I gather you would rather work with (Bio)Perl, but since you are
> already using Biopython to do the FASTQ conversion, you could
> also use it for more of your pipe line. Our tutorial includes examples
> of simple FASTQ quality filtering, and trimming of primer sequences
> (something like this might be helpful for removing adaptors). See:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> Alternatively, with the new release of EMBOSS this July, you will
> also be able to do the Illumina FASTQ to Sanger standard FASTQ
> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>
>> Regarding trimming bad quality bases (see comments from
>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>> pure/bioperl solution to be much faster than a primarily bioperl
>> based implementation. I found Bio::Seq->subseq(a,b) and
>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>> ~1300 sequences/second, including unzipping the raw data and
>> converting it to sanger fastq with biopython. Processing an entire
>> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> There are several ways of doing quality trimming, and it would
> make an excellent cookbook example (both for BioPerl and
> Biopython).
>
> Could you go into a bit more detail about your trimming
> algorithm? e.g. Do you just trim any bases on the right below
> a certain threshold, perhaps with a minimum length to retain
> the trimmed read afterwards?
>
>> Hope this looooong post was of interest to someone!
>
> I was interested at least ;)
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l