[Bioperl-l] Parsing FASTA files into PrimarySeq objects
Florent Angly
florent.angly at gmail.com
Tue Feb 16 06:18:16 UTC 2010
Thanks for your reply Jason. It's very valuable since you coded the
modules in question!
Now that I understand the purpose of these modules, I did a little bit
of documentation clarification. I also figured out that I should
implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and made
it support creating Bio::PrimarySeq. This is all committed to SVN.
When sequences are large, you are right that it does not make much
difference in time or memory resource to use one sequence type over the
other. However, that's different for short sequences. I took a FASTA
file weighting 39.3 MB and containing 360,000 sequences of 106 bp on
average.
CASE 1:
Requested factory Bio::Seq::SeqFactory
Requested sequence type is Bio::Seq
Memory used: 329 MB
Time spent: 1m8.522s
CASE 2:
Requested factory Bio::Seq::SeqFactory
Requested sequence type is Bio::PrimarySeq
Memory used: 246 MB
Time spent: 0m52.663s
CASE 3:
Requested factory Bio::Seq::SeqFastaSpeedFactory
Requested sequence type is Bio::Seq
Memory used: 298 MB
Time spent: 0m31.292s
CASE 4:
Requested factory Bio::Seq::SeqFastaSpeedFactory
Requested sequence type is Bio::PrimarySeq
Memory used: 231 MB
Time spent: 0m28.500s
Clearly, when only simple sequence attributes are needed or FASTA files
are used, it's fast and memory efficient to use the SeqFastaSpeedFactory
with PrimarySeq.
Florent
On 16/02/10 10:03, Jason Stajich wrote:
> I don't think that aspect of the documentation vs interface was ever
> implemented - the interface object doesn't specify a type method or
> init argument even though the documentation says so. Not really sure
> why not, this was ages ago unfortunately.
>
> This particular factory definitely assumes you are building Bio::Seq
> objects - you can try and subclass and build your own to see if it
> makes much of a difference in speed/memory - I would posit it won't
> make a significant difference but be interested to see what you find.
>
> Just make your own Bio::Seq::PrimarySeqFactory object based on
> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't build
> Bio::Seq object wrapper around the Bio::PrimarySeq and do some perf
> tests so we'll know if it makes any difference here.
>
> -jason
> Florent Angly wrote:
>> Hi all,
>>
>> I am trying to reduce memory usage and speedup reading FASTA files
>> using the facilities provided by BioPerl.
>>
>> The first thing I noticed is that when using Bio::SeqIO::fasta, the
>> objects returned are Bio::Seq, not Bio::PrimarySeq objects.
>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so
>> it's what I need. See code below:
>>> use warnings;
>>> use strict;
>>> use Data::Dumper;
>>> use Bio::SeqIO;
>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>> print "The factory is a ".ref($seqfactory)."\n";
>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>> my $seq = $in->next_seq;
>>> print Dumper($seq);
>>> __END__
>>> >seq1 a small test sequence q
>>> ACGTACGACTACGACTAGCGCCATCAGC
>> It returns:
>>> $VAR1 = bless( {
>>> 'primary_id' => 'seq1',
>>> 'primary_seq' => bless( {
>>> 'display_id' => 'seq1',
>>> 'primary_id' => 'seq1',
>>> 'desc' => 'a small test
>>> sequence',
>>> 'seq' =>
>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>> 'alphabet' => 'dna'
>>> }, 'Bio::PrimarySeq' )
>>> }, 'Bio::Seq' );
>>
>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I really
>> only need the Bio::PrimarySeq. Looking at the documentation for
>> Bio::SeqIO I found that I could in theory adjust the sequence factory
>> and sequence builder to my liking... I tried this:
>>> use warnings;
>>> use strict;
>>> use Data::Dumper;
>>> use Bio::SeqIO;
>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>> print "The factory is a ".ref($seqfactory)."\n";
>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>> my $seq = $in->next_seq;
>>> print Dumper($seq);
>>> __END__
>>> >seq1 a small test sequence
>>> ACGTACGACTACGACTAGCGCCATCAGC
>> This returns:
>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>> Can't locate object method "type" via package
>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 12,
>>> <DATA> line 1.
>>
>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>> If you want the factory to create Bio::Seq objects instead
>>> of the default Bio::PrimarySeq objects, use the -type parameter
>> So, PrimarySeq should be the default type, even though in my case it
>> seems not to be. Second, I can't seem to use the -type method to
>> change what the return type is... It errors.
>>
>> Any ideas??? Thanks,
>>
>> Florent
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list