[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Chris Fields cjfields at illinois.edu
Wed Feb 17 17:03:39 EST 2010


Rob, 

Looks like Jason took care of it.  There are likely more tests still
failing on trunk, but hopefully much fewer of them than before ;>

chris

On Wed, 2010-02-17 at 09:52 -0800, Robert Buels wrote:
> OK, well are you guys going to roll them back and do a branch?  Broken 
> trunk is not good.
> 
> R
> 
> Jason Stajich wrote:
> > hmm yes. Let's roll back those Florent.
> > 
> > I guess we need to also discuss the concept of branches again here for 
> > major changes like this. The speed fix is helpful for short read but 
> > probably not for all systems so just having the ability to provide the 
> > right type when you use your scripts focused on short-reads might be the 
> > best way to go forward and preserve backwards compatibility.
> > 
> > -jason
> > 
> > Chris Fields wrote:
> >> Yes, except that tests are now failing on live.  Bio::PrimarySeq isn't 
> >> a Bio::SeqI, so it's a bit of an API change in a lot of places 
> >> expecting a Bio::SeqI.  We should make this optional for the time 
> >> being, until we discuss it more on list.
> >>
> >> chris
> >>
> >> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
> >>
> >>   
> >>> Great!  thanks for committing it.
> >>>
> >>> Florent Angly wrote:
> >>>     
> >>>> Thanks for your reply Jason. It's very valuable since you coded the 
> >>>> modules in question!
> >>>>
> >>>> Now that I understand the purpose of these modules, I did a little 
> >>>> bit of documentation clarification. I also figured out that I should 
> >>>> implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and 
> >>>> made it support creating Bio::PrimarySeq. This is all committed to SVN.
> >>>>
> >>>> When sequences are large, you are right that it does not make much 
> >>>> difference in time or memory resource to use one sequence type over 
> >>>> the other. However, that's different for short sequences. I took a 
> >>>> FASTA file weighting 39.3 MB and containing 360,000 sequences of 106 
> >>>> bp on average.
> >>>>
> >>>> CASE 1:
> >>>>   Requested factory Bio::Seq::SeqFactory
> >>>>   Requested sequence type is Bio::Seq
> >>>>   Memory used: 329 MB
> >>>>   Time spent: 1m8.522s
> >>>>
> >>>> CASE 2:
> >>>>   Requested factory Bio::Seq::SeqFactory
> >>>>   Requested sequence type is Bio::PrimarySeq
> >>>>   Memory used: 246 MB
> >>>>   Time spent: 0m52.663s
> >>>>
> >>>> CASE 3:
> >>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
> >>>>   Requested sequence type is Bio::Seq
> >>>>   Memory used: 298 MB
> >>>>   Time spent: 0m31.292s
> >>>>
> >>>> CASE 4:
> >>>>   Requested factory Bio::Seq::SeqFastaSpeedFactory
> >>>>   Requested sequence type is Bio::PrimarySeq
> >>>>   Memory used: 231 MB
> >>>>   Time spent: 0m28.500s
> >>>>
> >>>> Clearly, when only simple sequence attributes are needed or FASTA 
> >>>> files are used, it's fast and memory efficient to use the 
> >>>> SeqFastaSpeedFactory with PrimarySeq.
> >>>>
> >>>> Florent
> >>>>
> >>>>
> >>>> On 16/02/10 10:03, Jason Stajich wrote:
> >>>>       
> >>>>> I don't think that aspect of the documentation vs interface was 
> >>>>> ever implemented - the interface object doesn't specify a type 
> >>>>> method or init argument even though the documentation says so. Not 
> >>>>> really sure why not, this was ages ago unfortunately.
> >>>>>
> >>>>> This particular factory definitely assumes you are building 
> >>>>> Bio::Seq objects - you can try and subclass and build your own to 
> >>>>> see if it makes much of a difference in speed/memory - I would 
> >>>>> posit it won't make a significant difference but be interested to 
> >>>>> see what you find.
> >>>>>
> >>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on 
> >>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't 
> >>>>> build Bio::Seq object wrapper around the Bio::PrimarySeq and do 
> >>>>> some perf tests so we'll know if it makes any difference here.
> >>>>>
> >>>>> -jason
> >>>>> Florent Angly wrote:
> >>>>>         
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I am trying to reduce memory usage and speedup reading FASTA files 
> >>>>>> using the facilities provided by BioPerl.
> >>>>>>
> >>>>>> The first thing I noticed is that when using Bio::SeqIO::fasta, 
> >>>>>> the objects returned are Bio::Seq, not Bio::PrimarySeq objects. 
> >>>>>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so 
> >>>>>> it's what I need. See code below:
> >>>>>>           
> >>>>>>> use warnings;
> >>>>>>> use strict;
> >>>>>>> use Data::Dumper;
> >>>>>>> use Bio::SeqIO;
> >>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
> >>>>>>> my $seqfactory = $in->sequence_factory; # 
> >>>>>>> Bio::Factory::ObjectBuilderI
> >>>>>>> print "The factory is a ".ref($seqfactory)."\n";
> >>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
> >>>>>>> my $seq = $in->next_seq;
> >>>>>>> print Dumper($seq);
> >>>>>>> __END__
> >>>>>>>             
> >>>>>>>> seq1 a small test sequence q
> >>>>>>>>                
> >>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
> >>>>>>>              
> >>>>>> It returns:
> >>>>>>           
> >>>>>>> $VAR1 = bless( {
> >>>>>>>                  'primary_id' =>  'seq1',
> >>>>>>>                  'primary_seq' =>  bless( {
> >>>>>>>                                            'display_id' =>  'seq1',
> >>>>>>>                                            'primary_id' =>  'seq1',
> >>>>>>>                                            'desc' =>  'a small 
> >>>>>>> test sequence',
> >>>>>>>                                            'seq' =>  
> >>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
> >>>>>>>                                            'alphabet' =>  'dna'
> >>>>>>>                                          }, 'Bio::PrimarySeq' )
> >>>>>>>                }, 'Bio::Seq' );
> >>>>>>>              
> >>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I 
> >>>>>> really only need the Bio::PrimarySeq. Looking at the documentation 
> >>>>>> for Bio::SeqIO I found that I could in theory adjust the sequence 
> >>>>>> factory and sequence builder to my liking... I tried this:
> >>>>>>           
> >>>>>>> use warnings;
> >>>>>>> use strict;
> >>>>>>> use Data::Dumper;
> >>>>>>> use Bio::SeqIO;
> >>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
> >>>>>>> my $seqfactory = $in->sequence_factory; # 
> >>>>>>> Bio::Factory::ObjectBuilderI
> >>>>>>> print "The factory is a ".ref($seqfactory)."\n";
> >>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
> >>>>>>> my $seq = $in->next_seq;
> >>>>>>> print Dumper($seq);
> >>>>>>> __END__
> >>>>>>>             
> >>>>>>>> seq1 a small test sequence
> >>>>>>>>                
> >>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
> >>>>>>>              
> >>>>>> This returns:
> >>>>>>           
> >>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
> >>>>>>> Can't locate object method "type" via package 
> >>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 
> >>>>>>> 12,<DATA>  line 1.
> >>>>>>>              
> >>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
> >>>>>>           
> >>>>>>> If you want the factory to create Bio::Seq objects instead
> >>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
> >>>>>>>              
> >>>>>> So, PrimarySeq should be the default type, even though in my case 
> >>>>>> it seems not to be. Second, I can't seem to use the -type method 
> >>>>>> to change what the return type is... It errors.
> >>>>>>
> >>>>>> Any ideas??? Thanks,
> >>>>>>
> >>>>>> Florent
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioperl-l mailing list
> >>>>>> Bioperl-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>>            
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>      
> >>
> >>    
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > 
> 
> 




More information about the Bioperl-l mailing list