[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Jason Stajich jason at bioperl.org
Tue Feb 16 16:57:14 UTC 2010


Great!  thanks for committing it.

Florent Angly wrote:
> Thanks for your reply Jason. It's very valuable since you coded the 
> modules in question!
>
> Now that I understand the purpose of these modules, I did a little bit 
> of documentation clarification. I also figured out that I should 
> implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and made 
> it support creating Bio::PrimarySeq. This is all committed to SVN.
>
> When sequences are large, you are right that it does not make much 
> difference in time or memory resource to use one sequence type over 
> the other. However, that's different for short sequences. I took a 
> FASTA file weighting 39.3 MB and containing 360,000 sequences of 106 
> bp on average.
>
> CASE 1:
>   Requested factory Bio::Seq::SeqFactory
>   Requested sequence type is Bio::Seq
>   Memory used: 329 MB
>   Time spent: 1m8.522s
>
> CASE 2:
>   Requested factory Bio::Seq::SeqFactory
>   Requested sequence type is Bio::PrimarySeq
>   Memory used: 246 MB
>   Time spent: 0m52.663s
>
> CASE 3:
>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>   Requested sequence type is Bio::Seq
>   Memory used: 298 MB
>   Time spent: 0m31.292s
>
> CASE 4:
>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>   Requested sequence type is Bio::PrimarySeq
>   Memory used: 231 MB
>   Time spent: 0m28.500s
>
> Clearly, when only simple sequence attributes are needed or FASTA 
> files are used, it's fast and memory efficient to use the 
> SeqFastaSpeedFactory with PrimarySeq.
>
> Florent
>
>
> On 16/02/10 10:03, Jason Stajich wrote:
>> I don't think that aspect of the documentation vs interface was ever 
>> implemented - the interface object doesn't specify a type method or 
>> init argument even though the documentation says so. Not really sure 
>> why not, this was ages ago unfortunately.
>>
>> This particular factory definitely assumes you are building Bio::Seq 
>> objects - you can try and subclass and build your own to see if it 
>> makes much of a difference in speed/memory - I would posit it won't 
>> make a significant difference but be interested to see what you find.
>>
>> Just make your own Bio::Seq::PrimarySeqFactory object based on 
>> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't 
>> build Bio::Seq object wrapper around the Bio::PrimarySeq and do some 
>> perf tests so we'll know if it makes any difference here.
>>
>> -jason
>> Florent Angly wrote:
>>> Hi all,
>>>
>>> I am trying to reduce memory usage and speedup reading FASTA files 
>>> using the facilities provided by BioPerl.
>>>
>>> The first thing I noticed is that when using Bio::SeqIO::fasta, the 
>>> objects returned are Bio::Seq, not Bio::PrimarySeq objects. 
>>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so 
>>> it's what I need. See code below:
>>>> use warnings;
>>>> use strict;
>>>> use Data::Dumper;
>>>> use Bio::SeqIO;
>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>> my $seq = $in->next_seq;
>>>> print Dumper($seq);
>>>> __END__
>>>> >seq1 a small test sequence q
>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>> It returns:
>>>> $VAR1 = bless( {
>>>>                  'primary_id' => 'seq1',
>>>>                  'primary_seq' => bless( {
>>>>                                            'display_id' => 'seq1',
>>>>                                            'primary_id' => 'seq1',
>>>>                                            'desc' => 'a small test 
>>>> sequence',
>>>>                                            'seq' => 
>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>                                            'alphabet' => 'dna'
>>>>                                          }, 'Bio::PrimarySeq' )
>>>>                }, 'Bio::Seq' ); 
>>>
>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I really 
>>> only need the Bio::PrimarySeq. Looking at the documentation for 
>>> Bio::SeqIO I found that I could in theory adjust the sequence 
>>> factory and sequence builder to my liking... I tried this:
>>>> use warnings;
>>>> use strict;
>>>> use Data::Dumper;
>>>> use Bio::SeqIO;
>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>> my $seq = $in->next_seq;
>>>> print Dumper($seq);
>>>> __END__
>>>> >seq1 a small test sequence
>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>> This returns:
>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>> Can't locate object method "type" via package 
>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 12, 
>>>> <DATA> line 1. 
>>>
>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>> If you want the factory to create Bio::Seq objects instead
>>>> of the default Bio::PrimarySeq objects, use the -type parameter 
>>> So, PrimarySeq should be the default type, even though in my case it 
>>> seems not to be. Second, I can't seem to use the -type method to 
>>> change what the return type is... It errors.
>>>
>>> Any ideas??? Thanks,
>>>
>>> Florent
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l 



More information about the Bioperl-l mailing list