[Bioperl-l] Parsing FASTA files into PrimarySeq objects
Florent Angly
florent.angly at gmail.com
Thu Feb 18 00:23:47 EST 2010
Sorry guys, I was out of internet range until now.
My intentions were not to break backward compatibility in my own
interest. I did run the tests that were directly relevant to my change.
My bad for not running _all_ BioPerl tests. I obviously underestimated
the scope of my changes and I should have run all tests...
Florent
PS/ Thank god there is SVN -_-'
On 18/02/10 13:48, Hilmar Lapp wrote:
> Thanks Jason - your help is hugely appreciated.
>
> -hilmar
>
> Sent from away
>
> On Feb 17, 2010, at 6:22 PM, Jason Stajich <jason at bioperl.org> wrote:
>
>> I rolled back.
>> The fix seems to just be to make the default Bio::Seq not
>> Bio::PrimarySeq in terms of what the factory return and undo some of
>> the explicit expectations in the code. I think this speedup is a
>> special case and Bio::Seq still needs to be the default but you
>> override it when you want Bio::PrimarySeq objects only.
>>
>> Florent for your purposes you need to just set the type in your scripts.
>>
>> And in the future - you must run tests before committing something,
>> especially one that touches so many files!
>>
>> I think I've unrolled most of the defaults while keeping the new type
>> method intact.
>>
>> There are still warnings and few failed tests to followup.
>>
>> Okay - I am really now in favor of breaking the modules into packages
>> test running taking forever and this is way unwieldy. I am looking
>> forward to seeing things in smaller pieces somehow.
>>
>> -jason
>> Robert Buels wrote:
>>> OK, well are you guys going to roll them back and do a branch?
>>> Broken trunk is not good.
>>>
>>> R
>>>
>>> Jason Stajich wrote:
>>>> hmm yes. Let's roll back those Florent.
>>>>
>>>> I guess we need to also discuss the concept of branches again here
>>>> for major changes like this. The speed fix is helpful for short
>>>> read but probably not for all systems so just having the ability to
>>>> provide the right type when you use your scripts focused on
>>>> short-reads might be the best way to go forward and preserve
>>>> backwards compatibility.
>>>>
>>>> -jason
>>>>
>>>> Chris Fields wrote:
>>>>> Yes, except that tests are now failing on live. Bio::PrimarySeq
>>>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of
>>>>> places expecting a Bio::SeqI. We should make this optional for
>>>>> the time being, until we discuss it more on list.
>>>>>
>>>>> chris
>>>>>
>>>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>>>
>>>>>> Great! thanks for committing it.
>>>>>>
>>>>>> Florent Angly wrote:
>>>>>>> Thanks for your reply Jason. It's very valuable since you coded
>>>>>>> the modules in question!
>>>>>>>
>>>>>>> Now that I understand the purpose of these modules, I did a
>>>>>>> little bit of documentation clarification. I also figured out
>>>>>>> that I should implement the 'type' method in
>>>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating
>>>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>>>
>>>>>>> When sequences are large, you are right that it does not make
>>>>>>> much difference in time or memory resource to use one sequence
>>>>>>> type over the other. However, that's different for short
>>>>>>> sequences. I took a FASTA file weighting 39.3 MB and containing
>>>>>>> 360,000 sequences of 106 bp on average.
>>>>>>>
>>>>>>> CASE 1:
>>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>>> Requested sequence type is Bio::Seq
>>>>>>> Memory used: 329 MB
>>>>>>> Time spent: 1m8.522s
>>>>>>>
>>>>>>> CASE 2:
>>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>>> Memory used: 246 MB
>>>>>>> Time spent: 0m52.663s
>>>>>>>
>>>>>>> CASE 3:
>>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>> Requested sequence type is Bio::Seq
>>>>>>> Memory used: 298 MB
>>>>>>> Time spent: 0m31.292s
>>>>>>>
>>>>>>> CASE 4:
>>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>>> Memory used: 231 MB
>>>>>>> Time spent: 0m28.500s
>>>>>>>
>>>>>>> Clearly, when only simple sequence attributes are needed or
>>>>>>> FASTA files are used, it's fast and memory efficient to use the
>>>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>>>
>>>>>>> Florent
>>>>>>>
>>>>>>>
>>>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>>>> I don't think that aspect of the documentation vs interface was
>>>>>>>> ever implemented - the interface object doesn't specify a type
>>>>>>>> method or init argument even though the documentation says so.
>>>>>>>> Not really sure why not, this was ages ago unfortunately.
>>>>>>>>
>>>>>>>> This particular factory definitely assumes you are building
>>>>>>>> Bio::Seq objects - you can try and subclass and build your own
>>>>>>>> to see if it makes much of a difference in speed/memory - I
>>>>>>>> would posit it won't make a significant difference but be
>>>>>>>> interested to see what you find.
>>>>>>>>
>>>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on
>>>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it
>>>>>>>> doesn't build Bio::Seq object wrapper around the
>>>>>>>> Bio::PrimarySeq and do some perf tests so we'll know if it
>>>>>>>> makes any difference here.
>>>>>>>>
>>>>>>>> -jason
>>>>>>>> Florent Angly wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am trying to reduce memory usage and speedup reading FASTA
>>>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>>>
>>>>>>>>> The first thing I noticed is that when using
>>>>>>>>> Bio::SeqIO::fasta, the objects returned are Bio::Seq, not
>>>>>>>>> Bio::PrimarySeq objects. Bio::PrimarySeq sequences are lighter
>>>>>>>>> than Bio::Seq sequences, so it's what I need. See code below:
>>>>>>>>>> use warnings;
>>>>>>>>>> use strict;
>>>>>>>>>> use Data::Dumper;
>>>>>>>>>> use Bio::SeqIO;
>>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>>> print Dumper($seq);
>>>>>>>>>> __END__
>>>>>>>>>>> seq1 a small test sequence q
>>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>>> It returns:
>>>>>>>>>> $VAR1 = bless( {
>>>>>>>>>> 'primary_id' => 'seq1',
>>>>>>>>>> 'primary_seq' => bless( {
>>>>>>>>>> 'display_id' =>
>>>>>>>>>> 'seq1',
>>>>>>>>>> 'primary_id' =>
>>>>>>>>>> 'seq1',
>>>>>>>>>> 'desc' => 'a small
>>>>>>>>>> test sequence',
>>>>>>>>>> 'seq' =>
>>>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>>> 'alphabet' => 'dna'
>>>>>>>>>> }, 'Bio::PrimarySeq' )
>>>>>>>>>> }, 'Bio::Seq' );
>>>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I
>>>>>>>>> really only need the Bio::PrimarySeq. Looking at the
>>>>>>>>> documentation for Bio::SeqIO I found that I could in theory
>>>>>>>>> adjust the sequence factory and sequence builder to my
>>>>>>>>> liking... I tried this:
>>>>>>>>>> use warnings;
>>>>>>>>>> use strict;
>>>>>>>>>> use Data::Dumper;
>>>>>>>>>> use Bio::SeqIO;
>>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>>> print Dumper($seq);
>>>>>>>>>> __END__
>>>>>>>>>>> seq1 a small test sequence
>>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>>> This returns:
>>>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>>>> Can't locate object method "type" via package
>>>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl
>>>>>>>>>> line 12,<DATA> line 1.
>>>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>>>> of the default Bio::PrimarySeq objects, use the -type parameter
>>>>>>>>> So, PrimarySeq should be the default type, even though in my
>>>>>>>>> case it seems not to be. Second, I can't seem to use the -type
>>>>>>>>> method to change what the return type is... It errors.
>>>>>>>>>
>>>>>>>>> Any ideas??? Thanks,
>>>>>>>>>
>>>>>>>>> Florent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioperl-l mailing list
>>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list