[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Wed Feb 17 23:22:56 UTC 2010

I rolled back.
The fix seems to just be to make the default Bio::Seq not 
Bio::PrimarySeq in terms of what the factory return and undo some of the 
explicit expectations in the code.  I think this speedup is a special 
case and Bio::Seq still needs to be the default but you override it when 
you want Bio::PrimarySeq objects only.

Florent for your purposes you need to just set the type in your scripts.

And in the future - you must run tests before committing something, 
especially one that touches so many files!

I think I've unrolled most of the defaults while keeping the new type 
method intact.

There are still warnings and few failed tests to followup.

Okay - I am really now in favor of breaking the modules into packages 
test running taking forever and this is way unwieldy. I am looking 
forward to seeing things in smaller pieces somehow.

-jason
Robert Buels wrote:
> OK, well are you guys going to roll them back and do a branch?  Broken 
> trunk is not good.
>
> R
>
> Jason Stajich wrote:
>> hmm yes. Let's roll back those Florent.
>>
>> I guess we need to also discuss the concept of branches again here 
>> for major changes like this. The speed fix is helpful for short read 
>> but probably not for all systems so just having the ability to 
>> provide the right type when you use your scripts focused on 
>> short-reads might be the best way to go forward and preserve 
>> backwards compatibility.
>>
>> -jason
>>
>> Chris Fields wrote:
>>> Yes, except that tests are now failing on live.  Bio::PrimarySeq 
>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of places 
>>> expecting a Bio::SeqI.  We should make this optional for the time 
>>> being, until we discuss it more on list.
>>>
>>> chris
>>>
>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>
>>>> Great!  thanks for committing it.
>>>>
>>>> Florent Angly wrote:
>>>>> Thanks for your reply Jason. It's very valuable since you coded 
>>>>> the modules in question!
>>>>>
>>>>> Now that I understand the purpose of these modules, I did a little 
>>>>> bit of documentation clarification. I also figured out that I 
>>>>> should implement the 'type' method in 
>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating 
>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>
>>>>> When sequences are large, you are right that it does not make much 
>>>>> difference in time or memory resource to use one sequence type 
>>>>> over the other. However, that's different for short sequences. I 
>>>>> took a FASTA file weighting 39.3 MB and containing 360,000 
>>>>> sequences of 106 bp on average.
>>>>>
>>>>> CASE 1:
>>>>>   Requested factory Bio::Seq::SeqFactory
>>>>>   Requested sequence type is Bio::Seq
>>>>>   Memory used: 329 MB
>>>>>   Time spent: 1m8.522s
>>>>>
>>>>> CASE 2:
>>>>>   Requested factory Bio::Seq::SeqFactory
>>>>>   Requested sequence type is Bio::PrimarySeq
>>>>>   Memory used: 246 MB
>>>>>   Time spent: 0m52.663s
>>>>>
>>>>> CASE 3:
>>>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>   Requested sequence type is Bio::Seq
>>>>>   Memory used: 298 MB
>>>>>   Time spent: 0m31.292s
>>>>>
>>>>> CASE 4:
>>>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>   Requested sequence type is Bio::PrimarySeq
>>>>>   Memory used: 231 MB
>>>>>   Time spent: 0m28.500s
>>>>>
>>>>> Clearly, when only simple sequence attributes are needed or FASTA 
>>>>> files are used, it's fast and memory efficient to use the 
>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>
>>>>> Florent
>>>>>
>>>>>
>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>> I don't think that aspect of the documentation vs interface was 
>>>>>> ever implemented - the interface object doesn't specify a type 
>>>>>> method or init argument even though the documentation says so. 
>>>>>> Not really sure why not, this was ages ago unfortunately.
>>>>>>
>>>>>> This particular factory definitely assumes you are building 
>>>>>> Bio::Seq objects - you can try and subclass and build your own to 
>>>>>> see if it makes much of a difference in speed/memory - I would 
>>>>>> posit it won't make a significant difference but be interested to 
>>>>>> see what you find.
>>>>>>
>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on 
>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't 
>>>>>> build Bio::Seq object wrapper around the Bio::PrimarySeq and do 
>>>>>> some perf tests so we'll know if it makes any difference here.
>>>>>>
>>>>>> -jason
>>>>>> Florent Angly wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am trying to reduce memory usage and speedup reading FASTA 
>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>
>>>>>>> The first thing I noticed is that when using Bio::SeqIO::fasta, 
>>>>>>> the objects returned are Bio::Seq, not Bio::PrimarySeq objects. 
>>>>>>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, 
>>>>>>> so it's what I need. See code below:
>>>>>>>> use warnings;
>>>>>>>> use strict;
>>>>>>>> use Data::Dumper;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>> my $seq = $in->next_seq;
>>>>>>>> print Dumper($seq);
>>>>>>>> __END__
>>>>>>>>> seq1 a small test sequence q 
>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>>>>>> It returns:
>>>>>>>> $VAR1 = bless( {
>>>>>>>>                  'primary_id' =>  'seq1',
>>>>>>>>                  'primary_seq' =>  bless( {
>>>>>>>>                                            'display_id' =>  
>>>>>>>> 'seq1',
>>>>>>>>                                            'primary_id' =>  
>>>>>>>> 'seq1',
>>>>>>>>                                            'desc' =>  'a small 
>>>>>>>> test sequence',
>>>>>>>>                                            'seq' =>  
>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>                                            'alphabet' =>  'dna'
>>>>>>>>                                          }, 'Bio::PrimarySeq' )
>>>>>>>>                }, 'Bio::Seq' ); 
>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I 
>>>>>>> really only need the Bio::PrimarySeq. Looking at the 
>>>>>>> documentation for Bio::SeqIO I found that I could in theory 
>>>>>>> adjust the sequence factory and sequence builder to my liking... 
>>>>>>> I tried this:
>>>>>>>> use warnings;
>>>>>>>> use strict;
>>>>>>>> use Data::Dumper;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>> my $seq = $in->next_seq;
>>>>>>>> print Dumper($seq);
>>>>>>>> __END__
>>>>>>>>> seq1 a small test sequence 
>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>>>>>> This returns:
>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>> Can't locate object method "type" via package 
>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 
>>>>>>>> 12,<DATA>  line 1. 
>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter 
>>>>>>> So, PrimarySeq should be the default type, even though in my 
>>>>>>> case it seems not to be. Second, I can't seem to use the -type 
>>>>>>> method to change what the return type is... It errors.
>>>>>>>
>>>>>>> Any ideas??? Thanks,
>>>>>>>
>>>>>>> Florent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l