[Bioperl-l] Next-gen modules

Tue Jun 30 07:28:25 EDT 2009

I'm developing a transcriptomics database for use with next-gen data, and
have found processing the raw data to be a big hurdle.

I'm a bit late in responding to this thread, so most issues have already
been discussed. One thing that hasn't been mentioned is removal of adapters
from raw Illumina sequence. This is a PITA, and I'm not aware of any well
developed and documented open source software for removal of adapters (and
poor quality sequence) from Illumina reads.

My current Illumina sequence processing pipeline is an unholy mix of
biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure
perl to trim the poor quality sequence from each read, and bioperl with
emboss to remove the adapter sequence. I'm aware that the pipeline contains
bugs and would like to simplify it, but at least it does work...

Ideally I'd like to replace as much of the pipeline as possible with
bioperl/bioperl-run, but this isn't currently possible due to both a lack of
features and poor performance. I'm sure the features will come with time,
but the performance is more of a concern to me. I wonder if Bio::Moose might
be used to alleviate some of the performance issues? Might next-gen modules
be an ideal guinea pig for Bio::Moose?

For my purposes the tools that would love to see supported in
bioperl/bioperl-run are:

   - next-gen sequence quality parsing (to output phred scores)
   - sequence quality based trimming
   - sequencing adapter removal
   - filtering based on sequence complexity (repeats, entropy etc)
   - bioperl-run modules for bowtie etc.

Obviously all of these need to be fast!
I'd love to muck in, but I doubt I'll contribute much before
Bio::Moose/bioperl6, as the (bio)perl object system gives me nightmares!

Regarding trimming bad quality bases (see comments from Tristan Lefebure)
from Solexa/Illumina reads, I did find a mixed pure/bioperl solution to be
much faster than a primarily bioperl based implementation. I found
Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. My
current code trims ~1300 sequences/second, including unzipping the raw data
and converting it to sanger fastq with biopython. Processing an entire
sequencing run with the whole pipeline takes in the region of 6-12h.

Hope this looooong post was of interest to someone!

Giles

2009/6/17 Tristan Lefebure <tristan.lefebure at gmail.com>

> Hello,
> Regarding next-gen sequences and bioperl, following my
> experience, another issue is bioperl speed. For example, if
> you want to trim bad quality bases at ends of 1E6 Solexa
> reads using Bio::SeqIO::fastq and some methods in
> Bio::Seq::Quality, well, you've got to be patient (but may
> be I missed some shortcuts...).
>
> A pure perl solution will be between 100 to 1000x faster...
> Would it be possible to have an ultra-light quality object
> with few simple methods for next-gen reads?
>
> I can contribute some tests if that sounds like an important
> point.
>
> -Tristan
>