[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Hilmar Lapp hlapp at drycafe.net
Wed Feb 17 22:48:00 EST 2010


Thanks Jason - your help is hugely appreciated.

-hilmar

Sent from away

On Feb 17, 2010, at 6:22 PM, Jason Stajich <jason at bioperl.org> wrote:

> I rolled back.
> The fix seems to just be to make the default Bio::Seq not  
> Bio::PrimarySeq in terms of what the factory return and undo some of  
> the explicit expectations in the code.  I think this speedup is a  
> special case and Bio::Seq still needs to be the default but you  
> override it when you want Bio::PrimarySeq objects only.
>
> Florent for your purposes you need to just set the type in your  
> scripts.
>
> And in the future - you must run tests before committing something,  
> especially one that touches so many files!
>
> I think I've unrolled most of the defaults while keeping the new  
> type method intact.
>
> There are still warnings and few failed tests to followup.
>
> Okay - I am really now in favor of breaking the modules into  
> packages test running taking forever and this is way unwieldy. I am  
> looking forward to seeing things in smaller pieces somehow.
>
> -jason
> Robert Buels wrote:
>> OK, well are you guys going to roll them back and do a branch?   
>> Broken trunk is not good.
>>
>> R
>>
>> Jason Stajich wrote:
>>> hmm yes. Let's roll back those Florent.
>>>
>>> I guess we need to also discuss the concept of branches again here  
>>> for major changes like this. The speed fix is helpful for short  
>>> read but probably not for all systems so just having the ability  
>>> to provide the right type when you use your scripts focused on  
>>> short-reads might be the best way to go forward and preserve  
>>> backwards compatibility.
>>>
>>> -jason
>>>
>>> Chris Fields wrote:
>>>> Yes, except that tests are now failing on live.  Bio::PrimarySeq  
>>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of  
>>>> places expecting a Bio::SeqI.  We should make this optional for  
>>>> the time being, until we discuss it more on list.
>>>>
>>>> chris
>>>>
>>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>>
>>>>> Great!  thanks for committing it.
>>>>>
>>>>> Florent Angly wrote:
>>>>>> Thanks for your reply Jason. It's very valuable since you coded  
>>>>>> the modules in question!
>>>>>>
>>>>>> Now that I understand the purpose of these modules, I did a  
>>>>>> little bit of documentation clarification. I also figured out  
>>>>>> that I should implement the 'type' method in  
>>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating  
>>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>>
>>>>>> When sequences are large, you are right that it does not make  
>>>>>> much difference in time or memory resource to use one sequence  
>>>>>> type over the other. However, that's different for short  
>>>>>> sequences. I took a FASTA file weighting 39.3 MB and containing  
>>>>>> 360,000 sequences of 106 bp on average.
>>>>>>
>>>>>> CASE 1:
>>>>>>  Requested factory Bio::Seq::SeqFactory
>>>>>>  Requested sequence type is Bio::Seq
>>>>>>  Memory used: 329 MB
>>>>>>  Time spent: 1m8.522s
>>>>>>
>>>>>> CASE 2:
>>>>>>  Requested factory Bio::Seq::SeqFactory
>>>>>>  Requested sequence type is Bio::PrimarySeq
>>>>>>  Memory used: 246 MB
>>>>>>  Time spent: 0m52.663s
>>>>>>
>>>>>> CASE 3:
>>>>>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>  Requested sequence type is Bio::Seq
>>>>>>  Memory used: 298 MB
>>>>>>  Time spent: 0m31.292s
>>>>>>
>>>>>> CASE 4:
>>>>>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>  Requested sequence type is Bio::PrimarySeq
>>>>>>  Memory used: 231 MB
>>>>>>  Time spent: 0m28.500s
>>>>>>
>>>>>> Clearly, when only simple sequence attributes are needed or  
>>>>>> FASTA files are used, it's fast and memory efficient to use the  
>>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>>
>>>>>> Florent
>>>>>>
>>>>>>
>>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>>> I don't think that aspect of the documentation vs interface  
>>>>>>> was ever implemented - the interface object doesn't specify a  
>>>>>>> type method or init argument even though the documentation  
>>>>>>> says so. Not really sure why not, this was ages ago  
>>>>>>> unfortunately.
>>>>>>>
>>>>>>> This particular factory definitely assumes you are building  
>>>>>>> Bio::Seq objects - you can try and subclass and build your own  
>>>>>>> to see if it makes much of a difference in speed/memory - I  
>>>>>>> would posit it won't make a significant difference but be  
>>>>>>> interested to see what you find.
>>>>>>>
>>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on  
>>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it  
>>>>>>> doesn't build Bio::Seq object wrapper around the  
>>>>>>> Bio::PrimarySeq and do some perf tests so we'll know if it  
>>>>>>> makes any difference here.
>>>>>>>
>>>>>>> -jason
>>>>>>> Florent Angly wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am trying to reduce memory usage and speedup reading FASTA  
>>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>>
>>>>>>>> The first thing I noticed is that when using  
>>>>>>>> Bio::SeqIO::fasta, the objects returned are Bio::Seq, not  
>>>>>>>> Bio::PrimarySeq objects. Bio::PrimarySeq sequences are  
>>>>>>>> lighter than Bio::Seq sequences, so it's what I need. See  
>>>>>>>> code below:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; #  
>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence q
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> It returns:
>>>>>>>>> $VAR1 = bless( {
>>>>>>>>>                 'primary_id' =>  'seq1',
>>>>>>>>>                 'primary_seq' =>  bless( {
>>>>>>>>>                                           'display_id' =>   
>>>>>>>>> 'seq1',
>>>>>>>>>                                           'primary_id' =>   
>>>>>>>>> 'seq1',
>>>>>>>>>                                           'desc' =>  'a  
>>>>>>>>> small test sequence',
>>>>>>>>>                                           'seq' =>   
>>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>>                                           'alphabet' =>  'dna'
>>>>>>>>>                                         }, 'Bio::PrimarySeq' )
>>>>>>>>>               }, 'Bio::Seq' );
>>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I  
>>>>>>>> really only need the Bio::PrimarySeq. Looking at the  
>>>>>>>> documentation for Bio::SeqIO I found that I could in theory  
>>>>>>>> adjust the sequence factory and sequence builder to my  
>>>>>>>> liking... I tried this:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; #  
>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> This returns:
>>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>>> Can't locate object method "type" via package  
>>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl  
>>>>>>>>> line 12,<DATA>  line 1.
>>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>>> of the default Bio::PrimarySeq objects, use the -type  
>>>>>>>>> parameter
>>>>>>>> So, PrimarySeq should be the default type, even though in my  
>>>>>>>> case it seems not to be. Second, I can't seem to use the - 
>>>>>>>> type method to change what the return type is... It errors.
>>>>>>>>
>>>>>>>> Any ideas??? Thanks,
>>>>>>>>
>>>>>>>> Florent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioperl-l mailing list
>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


More information about the Bioperl-l mailing list