[Bioperl-l] Parsing FASTA files into PrimarySeq objects
Chris Fields
cjfields at illinois.edu
Wed Feb 17 20:33:05 EST 2010
I can work on pushing out another 1.6 point release (1.6.2), then we should work on breaking things up.
chris
On Feb 17, 2010, at 5:22 PM, Jason Stajich wrote:
> I rolled back.
> The fix seems to just be to make the default Bio::Seq not Bio::PrimarySeq in terms of what the factory return and undo some of the explicit expectations in the code. I think this speedup is a special case and Bio::Seq still needs to be the default but you override it when you want Bio::PrimarySeq objects only.
>
> Florent for your purposes you need to just set the type in your scripts.
>
> And in the future - you must run tests before committing something, especially one that touches so many files!
>
> I think I've unrolled most of the defaults while keeping the new type method intact.
>
> There are still warnings and few failed tests to followup.
>
> Okay - I am really now in favor of breaking the modules into packages test running taking forever and this is way unwieldy. I am looking forward to seeing things in smaller pieces somehow.
>
> -jason
> Robert Buels wrote:
>> OK, well are you guys going to roll them back and do a branch? Broken trunk is not good.
>>
>> R
>>
>> Jason Stajich wrote:
>>> hmm yes. Let's roll back those Florent.
>>>
>>> I guess we need to also discuss the concept of branches again here for major changes like this. The speed fix is helpful for short read but probably not for all systems so just having the ability to provide the right type when you use your scripts focused on short-reads might be the best way to go forward and preserve backwards compatibility.
>>>
>>> -jason
>>>
>>> Chris Fields wrote:
>>>> Yes, except that tests are now failing on live. Bio::PrimarySeq isn't a Bio::SeqI, so it's a bit of an API change in a lot of places expecting a Bio::SeqI. We should make this optional for the time being, until we discuss it more on list.
>>>>
>>>> chris
>>>>
>>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>>
>>>>> Great! thanks for committing it.
>>>>>
>>>>> Florent Angly wrote:
>>>>>> Thanks for your reply Jason. It's very valuable since you coded the modules in question!
>>>>>>
>>>>>> Now that I understand the purpose of these modules, I did a little bit of documentation clarification. I also figured out that I should implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and made it support creating Bio::PrimarySeq. This is all committed to SVN.
>>>>>>
>>>>>> When sequences are large, you are right that it does not make much difference in time or memory resource to use one sequence type over the other. However, that's different for short sequences. I took a FASTA file weighting 39.3 MB and containing 360,000 sequences of 106 bp on average.
>>>>>>
>>>>>> CASE 1:
>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>> Requested sequence type is Bio::Seq
>>>>>> Memory used: 329 MB
>>>>>> Time spent: 1m8.522s
>>>>>>
>>>>>> CASE 2:
>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>> Memory used: 246 MB
>>>>>> Time spent: 0m52.663s
>>>>>>
>>>>>> CASE 3:
>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>> Requested sequence type is Bio::Seq
>>>>>> Memory used: 298 MB
>>>>>> Time spent: 0m31.292s
>>>>>>
>>>>>> CASE 4:
>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>> Memory used: 231 MB
>>>>>> Time spent: 0m28.500s
>>>>>>
>>>>>> Clearly, when only simple sequence attributes are needed or FASTA files are used, it's fast and memory efficient to use the SeqFastaSpeedFactory with PrimarySeq.
>>>>>>
>>>>>> Florent
>>>>>>
>>>>>>
>>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>>> I don't think that aspect of the documentation vs interface was ever implemented - the interface object doesn't specify a type method or init argument even though the documentation says so. Not really sure why not, this was ages ago unfortunately.
>>>>>>>
>>>>>>> This particular factory definitely assumes you are building Bio::Seq objects - you can try and subclass and build your own to see if it makes much of a difference in speed/memory - I would posit it won't make a significant difference but be interested to see what you find.
>>>>>>>
>>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't build Bio::Seq object wrapper around the Bio::PrimarySeq and do some perf tests so we'll know if it makes any difference here.
>>>>>>>
>>>>>>> -jason
>>>>>>> Florent Angly wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am trying to reduce memory usage and speedup reading FASTA files using the facilities provided by BioPerl.
>>>>>>>>
>>>>>>>> The first thing I noticed is that when using Bio::SeqIO::fasta, the objects returned are Bio::Seq, not Bio::PrimarySeq objects. Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so it's what I need. See code below:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence q
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> It returns:
>>>>>>>>> $VAR1 = bless( {
>>>>>>>>> 'primary_id' => 'seq1',
>>>>>>>>> 'primary_seq' => bless( {
>>>>>>>>> 'display_id' => 'seq1',
>>>>>>>>> 'primary_id' => 'seq1',
>>>>>>>>> 'desc' => 'a small test sequence',
>>>>>>>>> 'seq' => 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>> 'alphabet' => 'dna'
>>>>>>>>> }, 'Bio::PrimarySeq' )
>>>>>>>>> }, 'Bio::Seq' );
>>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I really only need the Bio::PrimarySeq. Looking at the documentation for Bio::SeqIO I found that I could in theory adjust the sequence factory and sequence builder to my liking... I tried this:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> This returns:
>>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>>> Can't locate object method "type" via package "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 12,<DATA> line 1.
>>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
>>>>>>>> So, PrimarySeq should be the default type, even though in my case it seems not to be. Second, I can't seem to use the -type method to change what the return type is... It errors.
>>>>>>>>
>>>>>>>> Any ideas??? Thanks,
>>>>>>>>
>>>>>>>> Florent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioperl-l mailing list
>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list