[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Florent Angly florent.angly at gmail.com
Thu Feb 18 00:23:47 EST 2010


Sorry guys, I was out of internet range until now.
My intentions were not to break backward compatibility in my own 
interest. I did run the tests that were directly relevant to my change. 
My bad for not running _all_ BioPerl tests. I obviously underestimated 
the scope of my changes and I should have run all tests...

Florent

PS/ Thank god there is SVN  -_-'


On 18/02/10 13:48, Hilmar Lapp wrote:
> Thanks Jason - your help is hugely appreciated.
>
> -hilmar
>
> Sent from away
>
> On Feb 17, 2010, at 6:22 PM, Jason Stajich <jason at bioperl.org> wrote:
>
>> I rolled back.
>> The fix seems to just be to make the default Bio::Seq not 
>> Bio::PrimarySeq in terms of what the factory return and undo some of 
>> the explicit expectations in the code.  I think this speedup is a 
>> special case and Bio::Seq still needs to be the default but you 
>> override it when you want Bio::PrimarySeq objects only.
>>
>> Florent for your purposes you need to just set the type in your scripts.
>>
>> And in the future - you must run tests before committing something, 
>> especially one that touches so many files!
>>
>> I think I've unrolled most of the defaults while keeping the new type 
>> method intact.
>>
>> There are still warnings and few failed tests to followup.
>>
>> Okay - I am really now in favor of breaking the modules into packages 
>> test running taking forever and this is way unwieldy. I am looking 
>> forward to seeing things in smaller pieces somehow.
>>
>> -jason
>> Robert Buels wrote:
>>> OK, well are you guys going to roll them back and do a branch?  
>>> Broken trunk is not good.
>>>
>>> R
>>>
>>> Jason Stajich wrote:
>>>> hmm yes. Let's roll back those Florent.
>>>>
>>>> I guess we need to also discuss the concept of branches again here 
>>>> for major changes like this. The speed fix is helpful for short 
>>>> read but probably not for all systems so just having the ability to 
>>>> provide the right type when you use your scripts focused on 
>>>> short-reads might be the best way to go forward and preserve 
>>>> backwards compatibility.
>>>>
>>>> -jason
>>>>
>>>> Chris Fields wrote:
>>>>> Yes, except that tests are now failing on live.  Bio::PrimarySeq 
>>>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of 
>>>>> places expecting a Bio::SeqI.  We should make this optional for 
>>>>> the time being, until we discuss it more on list.
>>>>>
>>>>> chris
>>>>>
>>>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>>>
>>>>>> Great!  thanks for committing it.
>>>>>>
>>>>>> Florent Angly wrote:
>>>>>>> Thanks for your reply Jason. It's very valuable since you coded 
>>>>>>> the modules in question!
>>>>>>>
>>>>>>> Now that I understand the purpose of these modules, I did a 
>>>>>>> little bit of documentation clarification. I also figured out 
>>>>>>> that I should implement the 'type' method in 
>>>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating 
>>>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>>>
>>>>>>> When sequences are large, you are right that it does not make 
>>>>>>> much difference in time or memory resource to use one sequence 
>>>>>>> type over the other. However, that's different for short 
>>>>>>> sequences. I took a FASTA file weighting 39.3 MB and containing 
>>>>>>> 360,000 sequences of 106 bp on average.
>>>>>>>
>>>>>>> CASE 1:
>>>>>>>  Requested factory Bio::Seq::SeqFactory
>>>>>>>  Requested sequence type is Bio::Seq
>>>>>>>  Memory used: 329 MB
>>>>>>>  Time spent: 1m8.522s
>>>>>>>
>>>>>>> CASE 2:
>>>>>>>  Requested factory Bio::Seq::SeqFactory
>>>>>>>  Requested sequence type is Bio::PrimarySeq
>>>>>>>  Memory used: 246 MB
>>>>>>>  Time spent: 0m52.663s
>>>>>>>
>>>>>>> CASE 3:
>>>>>>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>>  Requested sequence type is Bio::Seq
>>>>>>>  Memory used: 298 MB
>>>>>>>  Time spent: 0m31.292s
>>>>>>>
>>>>>>> CASE 4:
>>>>>>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>>  Requested sequence type is Bio::PrimarySeq
>>>>>>>  Memory used: 231 MB
>>>>>>>  Time spent: 0m28.500s
>>>>>>>
>>>>>>> Clearly, when only simple sequence attributes are needed or 
>>>>>>> FASTA files are used, it's fast and memory efficient to use the 
>>>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>>>
>>>>>>> Florent
>>>>>>>
>>>>>>>
>>>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>>>> I don't think that aspect of the documentation vs interface was 
>>>>>>>> ever implemented - the interface object doesn't specify a type 
>>>>>>>> method or init argument even though the documentation says so. 
>>>>>>>> Not really sure why not, this was ages ago unfortunately.
>>>>>>>>
>>>>>>>> This particular factory definitely assumes you are building 
>>>>>>>> Bio::Seq objects - you can try and subclass and build your own 
>>>>>>>> to see if it makes much of a difference in speed/memory - I 
>>>>>>>> would posit it won't make a significant difference but be 
>>>>>>>> interested to see what you find.
>>>>>>>>
>>>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on 
>>>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it 
>>>>>>>> doesn't build Bio::Seq object wrapper around the 
>>>>>>>> Bio::PrimarySeq and do some perf tests so we'll know if it 
>>>>>>>> makes any difference here.
>>>>>>>>
>>>>>>>> -jason
>>>>>>>> Florent Angly wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am trying to reduce memory usage and speedup reading FASTA 
>>>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>>>
>>>>>>>>> The first thing I noticed is that when using 
>>>>>>>>> Bio::SeqIO::fasta, the objects returned are Bio::Seq, not 
>>>>>>>>> Bio::PrimarySeq objects. Bio::PrimarySeq sequences are lighter 
>>>>>>>>> than Bio::Seq sequences, so it's what I need. See code below:
>>>>>>>>>> use warnings;
>>>>>>>>>> use strict;
>>>>>>>>>> use Data::Dumper;
>>>>>>>>>> use Bio::SeqIO;
>>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>>> print Dumper($seq);
>>>>>>>>>> __END__
>>>>>>>>>>> seq1 a small test sequence q
>>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>>> It returns:
>>>>>>>>>> $VAR1 = bless( {
>>>>>>>>>>                 'primary_id' =>  'seq1',
>>>>>>>>>>                 'primary_seq' =>  bless( {
>>>>>>>>>>                                           'display_id' =>  
>>>>>>>>>> 'seq1',
>>>>>>>>>>                                           'primary_id' =>  
>>>>>>>>>> 'seq1',
>>>>>>>>>>                                           'desc' =>  'a small 
>>>>>>>>>> test sequence',
>>>>>>>>>>                                           'seq' =>  
>>>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>>>                                           'alphabet' =>  'dna'
>>>>>>>>>>                                         }, 'Bio::PrimarySeq' )
>>>>>>>>>>               }, 'Bio::Seq' );
>>>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I 
>>>>>>>>> really only need the Bio::PrimarySeq. Looking at the 
>>>>>>>>> documentation for Bio::SeqIO I found that I could in theory 
>>>>>>>>> adjust the sequence factory and sequence builder to my 
>>>>>>>>> liking... I tried this:
>>>>>>>>>> use warnings;
>>>>>>>>>> use strict;
>>>>>>>>>> use Data::Dumper;
>>>>>>>>>> use Bio::SeqIO;
>>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>>> print Dumper($seq);
>>>>>>>>>> __END__
>>>>>>>>>>> seq1 a small test sequence
>>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>>> This returns:
>>>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>>>> Can't locate object method "type" via package 
>>>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl 
>>>>>>>>>> line 12,<DATA>  line 1.
>>>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
>>>>>>>>> So, PrimarySeq should be the default type, even though in my 
>>>>>>>>> case it seems not to be. Second, I can't seem to use the -type 
>>>>>>>>> method to change what the return type is... It errors.
>>>>>>>>>
>>>>>>>>> Any ideas??? Thanks,
>>>>>>>>>
>>>>>>>>> Florent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioperl-l mailing list
>>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list