[Bioperl-l] Parsing FASTA files into PrimarySeq objects
Jason Stajich
jason at bioperl.org
Wed Feb 17 18:22:56 EST 2010
I rolled back.
The fix seems to just be to make the default Bio::Seq not
Bio::PrimarySeq in terms of what the factory return and undo some of the
explicit expectations in the code. I think this speedup is a special
case and Bio::Seq still needs to be the default but you override it when
you want Bio::PrimarySeq objects only.
Florent for your purposes you need to just set the type in your scripts.
And in the future - you must run tests before committing something,
especially one that touches so many files!
I think I've unrolled most of the defaults while keeping the new type
method intact.
There are still warnings and few failed tests to followup.
Okay - I am really now in favor of breaking the modules into packages
test running taking forever and this is way unwieldy. I am looking
forward to seeing things in smaller pieces somehow.
-jason
Robert Buels wrote:
> OK, well are you guys going to roll them back and do a branch? Broken
> trunk is not good.
>
> R
>
> Jason Stajich wrote:
>> hmm yes. Let's roll back those Florent.
>>
>> I guess we need to also discuss the concept of branches again here
>> for major changes like this. The speed fix is helpful for short read
>> but probably not for all systems so just having the ability to
>> provide the right type when you use your scripts focused on
>> short-reads might be the best way to go forward and preserve
>> backwards compatibility.
>>
>> -jason
>>
>> Chris Fields wrote:
>>> Yes, except that tests are now failing on live. Bio::PrimarySeq
>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of places
>>> expecting a Bio::SeqI. We should make this optional for the time
>>> being, until we discuss it more on list.
>>>
>>> chris
>>>
>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>
>>>> Great! thanks for committing it.
>>>>
>>>> Florent Angly wrote:
>>>>> Thanks for your reply Jason. It's very valuable since you coded
>>>>> the modules in question!
>>>>>
>>>>> Now that I understand the purpose of these modules, I did a little
>>>>> bit of documentation clarification. I also figured out that I
>>>>> should implement the 'type' method in
>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating
>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>
>>>>> When sequences are large, you are right that it does not make much
>>>>> difference in time or memory resource to use one sequence type
>>>>> over the other. However, that's different for short sequences. I
>>>>> took a FASTA file weighting 39.3 MB and containing 360,000
>>>>> sequences of 106 bp on average.
>>>>>
>>>>> CASE 1:
>>>>> Requested factory Bio::Seq::SeqFactory
>>>>> Requested sequence type is Bio::Seq
>>>>> Memory used: 329 MB
>>>>> Time spent: 1m8.522s
>>>>>
>>>>> CASE 2:
>>>>> Requested factory Bio::Seq::SeqFactory
>>>>> Requested sequence type is Bio::PrimarySeq
>>>>> Memory used: 246 MB
>>>>> Time spent: 0m52.663s
>>>>>
>>>>> CASE 3:
>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>> Requested sequence type is Bio::Seq
>>>>> Memory used: 298 MB
>>>>> Time spent: 0m31.292s
>>>>>
>>>>> CASE 4:
>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>> Requested sequence type is Bio::PrimarySeq
>>>>> Memory used: 231 MB
>>>>> Time spent: 0m28.500s
>>>>>
>>>>> Clearly, when only simple sequence attributes are needed or FASTA
>>>>> files are used, it's fast and memory efficient to use the
>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>
>>>>> Florent
>>>>>
>>>>>
>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>> I don't think that aspect of the documentation vs interface was
>>>>>> ever implemented - the interface object doesn't specify a type
>>>>>> method or init argument even though the documentation says so.
>>>>>> Not really sure why not, this was ages ago unfortunately.
>>>>>>
>>>>>> This particular factory definitely assumes you are building
>>>>>> Bio::Seq objects - you can try and subclass and build your own to
>>>>>> see if it makes much of a difference in speed/memory - I would
>>>>>> posit it won't make a significant difference but be interested to
>>>>>> see what you find.
>>>>>>
>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on
>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't
>>>>>> build Bio::Seq object wrapper around the Bio::PrimarySeq and do
>>>>>> some perf tests so we'll know if it makes any difference here.
>>>>>>
>>>>>> -jason
>>>>>> Florent Angly wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am trying to reduce memory usage and speedup reading FASTA
>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>
>>>>>>> The first thing I noticed is that when using Bio::SeqIO::fasta,
>>>>>>> the objects returned are Bio::Seq, not Bio::PrimarySeq objects.
>>>>>>> Bio::PrimarySeq sequences are lighter than Bio::Seq sequences,
>>>>>>> so it's what I need. See code below:
>>>>>>>> use warnings;
>>>>>>>> use strict;
>>>>>>>> use Data::Dumper;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>> my $seq = $in->next_seq;
>>>>>>>> print Dumper($seq);
>>>>>>>> __END__
>>>>>>>>> seq1 a small test sequence q
>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>> It returns:
>>>>>>>> $VAR1 = bless( {
>>>>>>>> 'primary_id' => 'seq1',
>>>>>>>> 'primary_seq' => bless( {
>>>>>>>> 'display_id' =>
>>>>>>>> 'seq1',
>>>>>>>> 'primary_id' =>
>>>>>>>> 'seq1',
>>>>>>>> 'desc' => 'a small
>>>>>>>> test sequence',
>>>>>>>> 'seq' =>
>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>> 'alphabet' => 'dna'
>>>>>>>> }, 'Bio::PrimarySeq' )
>>>>>>>> }, 'Bio::Seq' );
>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I
>>>>>>> really only need the Bio::PrimarySeq. Looking at the
>>>>>>> documentation for Bio::SeqIO I found that I could in theory
>>>>>>> adjust the sequence factory and sequence builder to my liking...
>>>>>>> I tried this:
>>>>>>>> use warnings;
>>>>>>>> use strict;
>>>>>>>> use Data::Dumper;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>> my $seq = $in->next_seq;
>>>>>>>> print Dumper($seq);
>>>>>>>> __END__
>>>>>>>>> seq1 a small test sequence
>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>> This returns:
>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>> Can't locate object method "type" via package
>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line
>>>>>>>> 12,<DATA> line 1.
>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
>>>>>>> So, PrimarySeq should be the default type, even though in my
>>>>>>> case it seems not to be. Second, I can't seem to use the -type
>>>>>>> method to change what the return type is... It errors.
>>>>>>>
>>>>>>> Any ideas??? Thanks,
>>>>>>>
>>>>>>> Florent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list