[Biopython] [Bioperl-l] Next-gen modules

Wed Jul 1 07:44:12 UTC 2009

Hi all (BioPerl and Biopython),

This is a continuation of a long thread on the BioPerl mailing
list, which I have now CC'd to the Biopython mailing list. See:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html

On this thread we have been discussing next gen sequencing
tools and co-coordinating things like consistent file format
naming between Biopython, BioPerl and EMBOSS. I've been
chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
and he will look into setting up a cross project mailing list for
this kind of discussion in future.

In the mean time, my replies to Giles below cover both BioPerl
and Biopython (and EMBOSS). Giles' original email is here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html

Peter

On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters
> (and poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
> pure perl to trim the poor quality sequence from each read, and bioperl
> with emboss to remove the adapter sequence. I'm aware that the pipeline
> contains bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack
> of features and poor performance. I'm sure the features will come with
> time, but the performance is more of a concern to me. ..

I gather you would rather work with (Bio)Perl, but since you are
already using Biopython to do the FASTQ conversion, you could
also use it for more of your pipe line. Our tutorial includes examples
of simple FASTQ quality filtering, and trimming of primer sequences
(something like this might be helpful for removing adaptors). See:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Alternatively, with the new release of EMBOSS this July, you will
also be able to do the Illumina FASTQ to Sanger standard FASTQ
with EMBOSS, and I'm sure BioPerl will offer this soon too.

> Regarding trimming bad quality bases (see comments from
> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
> pure/bioperl solution to be much faster than a primarily bioperl
> based implementation. I found Bio::Seq->subseq(a,b) and
> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
> ~1300 sequences/second, including unzipping the raw data and
> converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

There are several ways of doing quality trimming, and it would
make an excellent cookbook example (both for BioPerl and
Biopython).

Could you go into a bit more detail about your trimming
algorithm? e.g. Do you just trim any bases on the right below
a certain threshold, perhaps with a minimum length to retain
the trimmed read afterwards?

> Hope this looooong post was of interest to someone!

I was interested at least ;)

Peter