[Bioperl-l] Next-gen modules

Tue Jun 30 21:58:57 UTC 2009

All,

I have committed the first run at adding Illumina/Solexa parsing for  
FASTQ along with tests.  It's very possible the quality scores are  
off, particularly for Solexa (Illumina 1.0), so test away and let me  
know if anything pops up (should be a quick fix).  Along with that is  
a small commit to Bio::SeqIO so that we can add format variants (see  
below for an example).  write_seq/write_qual/write_fastq will likely  
not work as expected as I haven't touched them; they are to be tackled  
next.

For faster parsing I have also added a next_dataset method that  
returns a hash reference to the parsed data instead of an object; this  
hash includes quality scores.  This method is called by next_seq and  
the relevant data is passed in to the sequence factory directly; one  
could do something like the following to filter sequences as needed:

use Modern::Perl;
use Bio::SeqIO;
use Bio::Seq::SeqFactory;

my $file = shift;

# same as (-format   => 'fastq', -variant => 'illumina')
my $in = Bio::SeqIO->new(-file     => $file,
                          -format   => 'fastq-illumina');

my $factory = Bio::Seq::SeqFactory->new(-type => 'Bio::Seq::Quality');

while (my $data = $in->next_dataset) {
     next if seq_is_crap($data);
     my $seq = $factory->create(%$data);
}

sub seq_is_crap { # filter here
}

chris