[Bioperl-l] Next-gen modules
Jonathan Epstein
Jonathan_Epstein at nih.gov
Wed Jul 1 09:20:50 EDT 2009
I too am interested in these topics. In particular, I would like to
learn more about "sequencing adapter removal," i.e. what these adapters
look like, and what strategies you've employed for finding and removing
them.
Jonathan
Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure
> perl to trim the poor quality sequence from each read, and bioperl with
> emboss to remove the adapter sequence. I'm aware that the pipeline contains
> bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack of
> features and poor performance. I'm sure the features will come with time,
> but the performance is more of a concern to me. I wonder if Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen modules
> be an ideal guinea pig for Bio::Moose?
>
> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
>
> - next-gen sequence quality parsing (to output phred scores)
> - sequence quality based trimming
> - sequencing adapter removal
> - filtering based on sequence complexity (repeats, entropy etc)
> - bioperl-run modules for bowtie etc.
>
> Obviously all of these need to be fast!
> I'd love to muck in, but I doubt I'll contribute much before
> Bio::Moose/bioperl6, as the (bio)perl object system gives me nightmares!
>
> Regarding trimming bad quality bases (see comments from Tristan Lefebure)
> from Solexa/Illumina reads, I did find a mixed pure/bioperl solution to be
> much faster than a primarily bioperl based implementation. I found
> Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. My
> current code trims ~1300 sequences/second, including unzipping the raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> Hope this looooong post was of interest to someone!
>
> Giles
>
> 2009/6/17 Tristan Lefebure <tristan.lefebure at gmail.com>
>
>
>> Hello,
>> Regarding next-gen sequences and bioperl, following my
>> experience, another issue is bioperl speed. For example, if
>> you want to trim bad quality bases at ends of 1E6 Solexa
>> reads using Bio::SeqIO::fastq and some methods in
>> Bio::Seq::Quality, well, you've got to be patient (but may
>> be I missed some shortcuts...).
>>
>> A pure perl solution will be between 100 to 1000x faster...
>> Would it be possible to have an ultra-light quality object
>> with few simple methods for next-gen reads?
>>
>> I can contribute some tests if that sounds like an important
>> point.
>>
>> -Tristan
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list