[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Florent Angly florent.angly at gmail.com
Tue Feb 16 06:18:16 UTC 2010


Thanks for your reply Jason. It's very valuable since you coded the 
modules in question!

Now that I understand the purpose of these modules, I did a little bit 
of documentation clarification. I also figured out that I should 
implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and made 
it support creating Bio::PrimarySeq. This is all committed to SVN.

When sequences are large, you are right that it does not make much 
difference in time or memory resource to use one sequence type over the 
other. However, that's different for short sequences. I took a FASTA 
file weighting 39.3 MB and containing 360,000 sequences of 106 bp on 
average.

CASE 1:
   Requested factory Bio::Seq::SeqFactory
   Requested sequence type is Bio::Seq
   Memory used: 329 MB
   Time spent: 1m8.522s

CASE 2:
   Requested factory Bio::Seq::SeqFactory
   Requested sequence type is Bio::PrimarySeq
   Memory used: 246 MB
   Time spent: 0m52.663s

CASE 3:
   Requested factory Bio::Seq::SeqFastaSpeedFactory
   Requested sequence type is Bio::Seq
   Memory used: 298 MB
   Time spent: 0m31.292s

CASE 4:
   Requested factory Bio::Seq::SeqFastaSpeedFactory
   Requested sequence type is Bio::PrimarySeq
   Memory used: 231 MB
   Time spent: 0m28.500s

Clearly, when only simple sequence attributes are needed or FASTA files 
are used, it's fast and memory efficient to use the SeqFastaSpeedFactory 
with PrimarySeq.

Florent


On 16/02/10 10:03, Jason Stajich wrote:
> I don't think that aspect of the documentation vs interface was ever 
> implemented - the interface object doesn't specify a type method or 
> init argument even though the documentation says so. Not really sure 
> why not, this was ages ago unfortunately.
>
> This particular factory definitely assumes you are building Bio::Seq 
> objects - you can try and subclass and build your own to see if it 
> makes much of a difference in speed/memory - I would posit it won't 
> make a significant difference but be interested to see what you find.
>
> Just make your own Bio::Seq::PrimarySeqFactory object based on 
> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't build 
> Bio::Seq object wrapper around the Bio::PrimarySeq and do some perf 
> tests so we'll know if it makes any difference here.
>
> -jason
> Florent Angly wrote:
>> Hi all,
>>
>> I am trying to reduce memory usage and speedup reading FASTA files 
>> using the facilities provided by BioPerl.
>>
>> The first thing I noticed is that when using Bio::SeqIO::fasta, the 
>> objects returned are Bio::Seq, not Bio::PrimarySeq objects. 
>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so 
>> it's what I need. See code below:
>>> use warnings;
>>> use strict;
>>> use Data::Dumper;
>>> use Bio::SeqIO;
>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>> print "The factory is a ".ref($seqfactory)."\n";
>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>> my $seq = $in->next_seq;
>>> print Dumper($seq);
>>> __END__
>>> >seq1 a small test sequence q
>>> ACGTACGACTACGACTAGCGCCATCAGC 
>> It returns:
>>> $VAR1 = bless( {
>>>                  'primary_id' => 'seq1',
>>>                  'primary_seq' => bless( {
>>>                                            'display_id' => 'seq1',
>>>                                            'primary_id' => 'seq1',
>>>                                            'desc' => 'a small test 
>>> sequence',
>>>                                            'seq' => 
>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>                                            'alphabet' => 'dna'
>>>                                          }, 'Bio::PrimarySeq' )
>>>                }, 'Bio::Seq' ); 
>>
>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I really 
>> only need the Bio::PrimarySeq. Looking at the documentation for 
>> Bio::SeqIO I found that I could in theory adjust the sequence factory 
>> and sequence builder to my liking... I tried this:
>>> use warnings;
>>> use strict;
>>> use Data::Dumper;
>>> use Bio::SeqIO;
>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>> print "The factory is a ".ref($seqfactory)."\n";
>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>> my $seq = $in->next_seq;
>>> print Dumper($seq);
>>> __END__
>>> >seq1 a small test sequence
>>> ACGTACGACTACGACTAGCGCCATCAGC 
>> This returns:
>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>> Can't locate object method "type" via package 
>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 12, 
>>> <DATA> line 1. 
>>
>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>> If you want the factory to create Bio::Seq objects instead
>>> of the default Bio::PrimarySeq objects, use the -type parameter 
>> So, PrimarySeq should be the default type, even though in my case it 
>> seems not to be. Second, I can't seem to use the -type method to 
>> change what the return type is... It errors.
>>
>> Any ideas??? Thanks,
>>
>> Florent
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l 




More information about the Bioperl-l mailing list