[Bioperl-l] Parsing FASTA files into PrimarySeq objects
Hilmar Lapp
hlapp at drycafe.net
Wed Feb 17 22:48:00 EST 2010
Thanks Jason - your help is hugely appreciated.
-hilmar
Sent from away
On Feb 17, 2010, at 6:22 PM, Jason Stajich <jason at bioperl.org> wrote:
> I rolled back.
> The fix seems to just be to make the default Bio::Seq not
> Bio::PrimarySeq in terms of what the factory return and undo some of
> the explicit expectations in the code. I think this speedup is a
> special case and Bio::Seq still needs to be the default but you
> override it when you want Bio::PrimarySeq objects only.
>
> Florent for your purposes you need to just set the type in your
> scripts.
>
> And in the future - you must run tests before committing something,
> especially one that touches so many files!
>
> I think I've unrolled most of the defaults while keeping the new
> type method intact.
>
> There are still warnings and few failed tests to followup.
>
> Okay - I am really now in favor of breaking the modules into
> packages test running taking forever and this is way unwieldy. I am
> looking forward to seeing things in smaller pieces somehow.
>
> -jason
> Robert Buels wrote:
>> OK, well are you guys going to roll them back and do a branch?
>> Broken trunk is not good.
>>
>> R
>>
>> Jason Stajich wrote:
>>> hmm yes. Let's roll back those Florent.
>>>
>>> I guess we need to also discuss the concept of branches again here
>>> for major changes like this. The speed fix is helpful for short
>>> read but probably not for all systems so just having the ability
>>> to provide the right type when you use your scripts focused on
>>> short-reads might be the best way to go forward and preserve
>>> backwards compatibility.
>>>
>>> -jason
>>>
>>> Chris Fields wrote:
>>>> Yes, except that tests are now failing on live. Bio::PrimarySeq
>>>> isn't a Bio::SeqI, so it's a bit of an API change in a lot of
>>>> places expecting a Bio::SeqI. We should make this optional for
>>>> the time being, until we discuss it more on list.
>>>>
>>>> chris
>>>>
>>>> On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:
>>>>
>>>>> Great! thanks for committing it.
>>>>>
>>>>> Florent Angly wrote:
>>>>>> Thanks for your reply Jason. It's very valuable since you coded
>>>>>> the modules in question!
>>>>>>
>>>>>> Now that I understand the purpose of these modules, I did a
>>>>>> little bit of documentation clarification. I also figured out
>>>>>> that I should implement the 'type' method in
>>>>>> Bio::Seq::SeqFastaSpeedFactory and made it support creating
>>>>>> Bio::PrimarySeq. This is all committed to SVN.
>>>>>>
>>>>>> When sequences are large, you are right that it does not make
>>>>>> much difference in time or memory resource to use one sequence
>>>>>> type over the other. However, that's different for short
>>>>>> sequences. I took a FASTA file weighting 39.3 MB and containing
>>>>>> 360,000 sequences of 106 bp on average.
>>>>>>
>>>>>> CASE 1:
>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>> Requested sequence type is Bio::Seq
>>>>>> Memory used: 329 MB
>>>>>> Time spent: 1m8.522s
>>>>>>
>>>>>> CASE 2:
>>>>>> Requested factory Bio::Seq::SeqFactory
>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>> Memory used: 246 MB
>>>>>> Time spent: 0m52.663s
>>>>>>
>>>>>> CASE 3:
>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>> Requested sequence type is Bio::Seq
>>>>>> Memory used: 298 MB
>>>>>> Time spent: 0m31.292s
>>>>>>
>>>>>> CASE 4:
>>>>>> Requested factory Bio::Seq::SeqFastaSpeedFactory
>>>>>> Requested sequence type is Bio::PrimarySeq
>>>>>> Memory used: 231 MB
>>>>>> Time spent: 0m28.500s
>>>>>>
>>>>>> Clearly, when only simple sequence attributes are needed or
>>>>>> FASTA files are used, it's fast and memory efficient to use the
>>>>>> SeqFastaSpeedFactory with PrimarySeq.
>>>>>>
>>>>>> Florent
>>>>>>
>>>>>>
>>>>>> On 16/02/10 10:03, Jason Stajich wrote:
>>>>>>> I don't think that aspect of the documentation vs interface
>>>>>>> was ever implemented - the interface object doesn't specify a
>>>>>>> type method or init argument even though the documentation
>>>>>>> says so. Not really sure why not, this was ages ago
>>>>>>> unfortunately.
>>>>>>>
>>>>>>> This particular factory definitely assumes you are building
>>>>>>> Bio::Seq objects - you can try and subclass and build your own
>>>>>>> to see if it makes much of a difference in speed/memory - I
>>>>>>> would posit it won't make a significant difference but be
>>>>>>> interested to see what you find.
>>>>>>>
>>>>>>> Just make your own Bio::Seq::PrimarySeqFactory object based on
>>>>>>> Bio::Seq::FastaSpeedFactory and simplify that code so it
>>>>>>> doesn't build Bio::Seq object wrapper around the
>>>>>>> Bio::PrimarySeq and do some perf tests so we'll know if it
>>>>>>> makes any difference here.
>>>>>>>
>>>>>>> -jason
>>>>>>> Florent Angly wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am trying to reduce memory usage and speedup reading FASTA
>>>>>>>> files using the facilities provided by BioPerl.
>>>>>>>>
>>>>>>>> The first thing I noticed is that when using
>>>>>>>> Bio::SeqIO::fasta, the objects returned are Bio::Seq, not
>>>>>>>> Bio::PrimarySeq objects. Bio::PrimarySeq sequences are
>>>>>>>> lighter than Bio::Seq sequences, so it's what I need. See
>>>>>>>> code below:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence q
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> It returns:
>>>>>>>>> $VAR1 = bless( {
>>>>>>>>> 'primary_id' => 'seq1',
>>>>>>>>> 'primary_seq' => bless( {
>>>>>>>>> 'display_id' =>
>>>>>>>>> 'seq1',
>>>>>>>>> 'primary_id' =>
>>>>>>>>> 'seq1',
>>>>>>>>> 'desc' => 'a
>>>>>>>>> small test sequence',
>>>>>>>>> 'seq' =>
>>>>>>>>> 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>>>>> 'alphabet' => 'dna'
>>>>>>>>> }, 'Bio::PrimarySeq' )
>>>>>>>>> }, 'Bio::Seq' );
>>>>>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I
>>>>>>>> really only need the Bio::PrimarySeq. Looking at the
>>>>>>>> documentation for Bio::SeqIO I found that I could in theory
>>>>>>>> adjust the sequence factory and sequence builder to my
>>>>>>>> liking... I tried this:
>>>>>>>>> use warnings;
>>>>>>>>> use strict;
>>>>>>>>> use Data::Dumper;
>>>>>>>>> use Bio::SeqIO;
>>>>>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>>>>>> my $seqfactory = $in->sequence_factory; #
>>>>>>>>> Bio::Factory::ObjectBuilderI
>>>>>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>>>>>> my $seq = $in->next_seq;
>>>>>>>>> print Dumper($seq);
>>>>>>>>> __END__
>>>>>>>>>> seq1 a small test sequence
>>>>>>>>> ACGTACGACTACGACTAGCGCCATCAGC
>>>>>>>> This returns:
>>>>>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>>>>>> Can't locate object method "type" via package
>>>>>>>>> "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl
>>>>>>>>> line 12,<DATA> line 1.
>>>>>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>>>>>> If you want the factory to create Bio::Seq objects instead
>>>>>>>>> of the default Bio::PrimarySeq objects, use the -type
>>>>>>>>> parameter
>>>>>>>> So, PrimarySeq should be the default type, even though in my
>>>>>>>> case it seems not to be. Second, I can't seem to use the -
>>>>>>>> type method to change what the return type is... It errors.
>>>>>>>>
>>>>>>>> Any ideas??? Thanks,
>>>>>>>>
>>>>>>>> Florent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioperl-l mailing list
>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list