[Bioperl-l] Next-gen modules

Tue Jun 30 13:46:27 EDT 2009

On Jun 30, 2009, at 6:28 AM, Giles Weaver wrote:

> I'm developing a transcriptomics database for use with next-gen  
> data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have  
> already
> been discussed. One thing that hasn't been mentioned is removal of  
> adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any  
> well
> developed and documented open source software for removal of  
> adapters (and
> poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for  
> converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality  
> values, pure
> perl to trim the poor quality sequence from each read, and bioperl  
> with
> emboss to remove the adapter sequence. I'm aware that the pipeline  
> contains
> bugs and would like to simplify it, but at least it does work...

My local bioperl is working with FASTQ parsing of Sanger and Illumina  
(but not solexa yet).  I'll commit what I have today, and we should be  
able to add in solexa soon.  We'll also need to add in write_seq  
support.

> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a  
> lack of
> features and poor performance. I'm sure the features will come with  
> time,
> but the performance is more of a concern to me. I wonder if  
> Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen  
> modules
> be an ideal guinea pig for Bio::Moose?

We should get FASTQ working in core first then optimize on speed (as  
Elia previously pointed out).  We can do that within the actual SeqIO  
parser using a few simple tricks. For instance my local  
Bio::SeqIO::fastq has a reconfigured next_seq to call an iterator that  
returns raw processed data as a simple hash ref; users have access to  
that method, so if one wanted they could retrieve the raw data  
directly, or pass it through a filter that only creates seq instances  
one wants on the fly (that would be where your quality checks, adaptor  
modification, etc. fit in).

In the end it might be to wrap a C/C++-based solution for speed.  As  
mentioned previously a C-based parser exists from Sanger Centre that  
we could incorporate in some fashion, but I would like if it were able  
to report back file position for fast indexing.  The code is fairly  
simple so it should be too hard to incorporate that in somehow.

Just so there is no confusion, Bio::Moose is an attempt to both lay  
out plans for perl6 and deal with inheritance issues within bioperl  
now. It's still in very early development and may not see a release  
until Dec. at the very earliest, it will be an alpha release then, and  
likely won't have every major class represented at that point.  It's  
also not intended to be backwards-compatible with bioperl core.  It  
may help, but that's not an absolute certainty.  As for bioperl6, it  
will be pre-alpha until perl6 spec reaches a stable draft and we have  
an active implementation.

> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
>
>   - next-gen sequence quality parsing (to output phred scores)
>   - sequence quality based trimming
>   - sequencing adapter removal
>   - filtering based on sequence complexity (repeats, entropy etc)
>   - bioperl-run modules for bowtie etc.
>
> Obviously all of these need to be fast!
> I'd love to muck in, but I doubt I'll contribute much before
> Bio::Moose/bioperl6, as the (bio)perl object system gives me  
> nightmares!

One can only read a file so fast (even with a highly optimized C/C++  
based parser), but I don't think that will be the limiting factor as  
much as object instantiation.

> Regarding trimming bad quality bases (see comments from Tristan  
> Lefebure)
> from Solexa/Illumina reads, I did find a mixed pure/bioperl solution  
> to be
> much faster than a primarily bioperl based implementation. I found
> Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow.  
> My
> current code trims ~1300 sequences/second, including unzipping the  
> raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

Right, hence coming up with a 'pre-filter' for raw data (hash refs)  
prior to object instantiation to speed things up.  This will be a bit  
easier with Bio::Moose as we can introspect attributes via the meta  
class, but this will be a while yet.

> Hope this looooong post was of interest to someone!
>
> Giles

It's always good to hear about such issues and what one expects.

chris