[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Robert Buels rmb32 at cornell.edu
Wed Feb 17 12:52:00 EST 2010


OK, well are you guys going to roll them back and do a branch?  Broken 
trunk is not good.

R

Jason Stajich wrote:
> hmm yes. Let's roll back those Florent.
> 
> I guess we need to also discuss the concept of branches again here for 
> major changes like this. The speed fix is helpful for short read but 
> probably not for all systems so just having the ability to provide the 
> right type when you use your scripts focused on short-reads might be the 
> best way to go forward and preserve backwards compatibility.
> 
> -jason
> 
> Chris Fields wrote:
>> Yes, except that tests are now failing on live.  Bio::PrimarySeq isn't 
>> a Bio::SeqI, so it's a bit of an API change in a lot of places 
>> expecting a Bio::SeqI.  We should make this optional for the time 
>> being, until we discuss it more on list.
>>
>> chris
>>
>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>
>>   
>>> Great!  thanks for committing it.
>>>
>>> Florent Angly wrote:
>>>     
>>>> Thanks for your reply Jason. It's very valuable since you coded the 
>>>> modules in question!
>>>>
>>>> Now that I understand the purpose of these modules, I did a little 
>>>> bit of documentation clarification. I also figured out that I should 
>>>> implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and 
>>>> made it support creating Bio::PrimarySeq. This is all committed to SVN.
>>>>
>>>> When sequences are large, you are right that it does not make much 
>>>> difference in time or memory resource to use one sequence type over 
>>>> the other. However, that's different for short sequences. I took a 
>>>> FASTA file weighting 39.3 MB and containing 360,000 sequences of 106 
>>>> bp on average.
>>>>
>>>> CASE 1:
>>>>   Requested factory Bio::Seq::SeqFactory
>>>>   Requested sequence type is Bio::Seq
>>>>   Memory used: 329 MB
>>>>   Time spent: 1m8.522s
>>>>
>>>> CASE 2:
>>>>   Requested factory Bio::Seq::SeqFactory
>>>>   Requested sequence type is Bio::PrimarySeq
>>>>   Memory used: 246 MB
>>>>   Time spent: 0m52.663s
>>>>
>>>> CASE 3:
>>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>   Requested sequence type is Bio::Seq
>>>>   Memory used: 298 MB
>>>>   Time spent: 0m31.292s
>>>>
>>>> CASE 4:
>>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>   Requested sequence type is Bio::PrimarySeq
>>>>   Memory used: 231 MB
>>>>   Time spent: 0m28.500s
>>>>
>>>> Clearly, when only simple sequence attributes are needed or FASTA 
>>>> files are used, it's fast and memory efficient to use the 
>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>
>>>> Florent
>>>>
>>>>
>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>       
>>>>> I don't think that aspect of the documentation vs interface was 
>>>>> ever implemented - the interface object doesn't specify a type 
>>>>> method or init argument even though the documentation says so. Not 
>>>>> really sure why not, this was ages ago unfortunately.
>>>>>
>>>>> This particular factory definitely assumes you are building 
>>>>> Bio::Seq objects - you can try and subclass and build your own to 
>>>>> see if it makes much of a difference in speed/memory - I would 
>>>>> posit it won't make a significant difference but be interested to 
>>>>> see what you find.
>>>>>
>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on 
>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't 
>>>>> build Bio::Seq object wrapper around the Bio::PrimarySeq and do 
>>>>> some perf tests so we'll know if it makes any difference here.
>>>>>
>>>>> -jason
>>>>> Florent Angly wrote:
>>>>>         
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to reduce memory usage and speedup reading FASTA files 
>>>>>> using the facilities provided by BioPerl.
>>>>>>
>>>>>> The first thing I noticed is that when using Bio::SeqIO::fasta, 
>>>>>> the objects returned are Bio::Seq, not Bio::PrimarySeq objects. 
>>>>>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so 
>>>>>> it's what I need. See code below:
>>>>>>           
>>>>>>> use warnings;
>>>>>>> use strict;
>>>>>>> use Data::Dumper;
>>>>>>> use Bio::SeqIO;
>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>> my $seq = $in->next_seq;
>>>>>>> print Dumper($seq);
>>>>>>> __END__
>>>>>>>             
>>>>>>>> seq1 a small test sequence q
>>>>>>>>                
>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>              
>>>>>> It returns:
>>>>>>           
>>>>>>> $VAR1 = bless( {
>>>>>>>                  'primary_id' =>  'seq1',
>>>>>>>                  'primary_seq' =>  bless( {
>>>>>>>                                            'display_id' =>  'seq1',
>>>>>>>                                            'primary_id' =>  'seq1',
>>>>>>>                                            'desc' =>  'a small 
>>>>>>> test sequence',
>>>>>>>                                            'seq' =>  
>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>                                            'alphabet' =>  'dna'
>>>>>>>                                          }, 'Bio::PrimarySeq' )
>>>>>>>                }, 'Bio::Seq' );
>>>>>>>              
>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I 
>>>>>> really only need the Bio::PrimarySeq. Looking at the documentation 
>>>>>> for Bio::SeqIO I found that I could in theory adjust the sequence 
>>>>>> factory and sequence builder to my liking... I tried this:
>>>>>>           
>>>>>>> use warnings;
>>>>>>> use strict;
>>>>>>> use Data::Dumper;
>>>>>>> use Bio::SeqIO;
>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>> my $seqfactory = $in->sequence_factory; # 
>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>> my $seq = $in->next_seq;
>>>>>>> print Dumper($seq);
>>>>>>> __END__
>>>>>>>             
>>>>>>>> seq1 a small test sequence
>>>>>>>>                
>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>              
>>>>>> This returns:
>>>>>>           
>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>> Can't locate object method "type" via package 
>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 
>>>>>>> 12,<DATA>  line 1.
>>>>>>>              
>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>           
>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
>>>>>>>              
>>>>>> So, PrimarySeq should be the default type, even though in my case 
>>>>>> it seems not to be. Second, I can't seem to use the -type method 
>>>>>> to change what the return type is... It errors.
>>>>>>
>>>>>> Any ideas??? Thanks,
>>>>>>
>>>>>> Florent
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>            
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>      
>>
>>    
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 


-- 
Robert Buels
Bioinformatics Analyst, Sol Genomics Network
Boyce Thompson Institute for Plant Research
Tower Rd
Ithaca, NY  14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu


More information about the Bioperl-l mailing list