[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Tue Jun 30 21:58:57 UTC 2009
All,
I have committed the first run at adding Illumina/Solexa parsing for
FASTQ along with tests. It's very possible the quality scores are
off, particularly for Solexa (Illumina 1.0), so test away and let me
know if anything pops up (should be a quick fix). Along with that is
a small commit to Bio::SeqIO so that we can add format variants (see
below for an example). write_seq/write_qual/write_fastq will likely
not work as expected as I haven't touched them; they are to be tackled
next.
For faster parsing I have also added a next_dataset method that
returns a hash reference to the parsed data instead of an object; this
hash includes quality scores. This method is called by next_seq and
the relevant data is passed in to the sequence factory directly; one
could do something like the following to filter sequences as needed:
use Modern::Perl;
use Bio::SeqIO;
use Bio::Seq::SeqFactory;
my $file = shift;
# same as (-format => 'fastq', -variant => 'illumina')
my $in = Bio::SeqIO->new(-file => $file,
-format => 'fastq-illumina');
my $factory = Bio::Seq::SeqFactory->new(-type => 'Bio::Seq::Quality');
while (my $data = $in->next_dataset) {
next if seq_is_crap($data);
my $seq = $factory->create(%$data);
}
sub seq_is_crap { # filter here
}
chris
More information about the Bioperl-l
mailing list