[Bioperl-l] SeqIO profiling

Jason Stajich jason@cgt.mc.duke.edu
Sun, 22 Sep 2002 16:18:36 -0400 (EDT)


On Thu, 19 Sep 2002, Hilmar Lapp wrote:

> Another time-waster may be that the fasta-parser actually
> instantiates Bio::Seq objects. This one would create an empty
> Bio::Annotation::Collection. Also, Bio::Seq internally has a
> Bio::PrimarySeq object that it needs to create.
>
> If you're not going to annotate the sequences you should be better
> off by passing the $seqio a SeqFactory that will create
> Bio::PrimarySeq objects:
>
>      my $fact = Bio::Seq::SeqFactory->new(-type => "Bio::PrimarySeq");
>      $seqio->sequence_factory($fact);
>      # now loop over $seqio->next_seq()
>
This is definitely good practice -- but I think we need to move
documentation of this feature up into the POD more prominently (Bio::SeqIO
I guess) as it can throw people for a loop if you read in fasta and try
and write the same sequence to the embl writer which will complain (but
not bail) if it is not getting SeqI objects.  I chose to make Bio::Seq the
default to avoid this behavior and power-users who want the speedup need
to enable the above.  Incidently, event based parsing will surely speed-up
things as well when we can tell a genbank/embl parser to skip huge
sections of a file, I see this perf boost in my blast parser when you tell
it you only want Hits and not HSPs too (set a different event listener,
but the parsing still proceeds across the whole data file).  So I do think
making a faster parser is less of a payoff than making object creation
faster.

> Alternatively, one could smart-ify the respective accessors in
> Bio::Seq to instantiate the objects only once they're requested and
> not yet set.
>
> What I listed only explains a factor of 3x, not 6x though ...

Additionally I just made the seqfeature collection instantiation lazy so
it is only built when requested in Bio::Seq.  Don't think we'll see much
improvement but would be good to test how much effect it had on fasta
format sequence parsing.

-jason



>
> -hilmar
>
> On Tuesday, September 17, 2002, at 05:44 AM, Aaron J Mackey wrote:
>
> >
> > With my just-committed patches to Root and RootI (BEGIN->INIT and some
> > small optimizations to _rearrange), things are marginally better
> > (previously this took around 30 seconds, and _rearrange took more
> > than 7
> > seconds of the time):
> >
> > Total Elapsed Time = 23.99110 Seconds
> >          User Time =    26.62 Seconds
> > Exclusive Times
> > %Time ExclSec CumulS #Calls sec/call Csec/c  Name
> >  16.0   4.270  4.270 119910   0.0000 0.0000
> > Bio::Root::RootI::_rearrange
> >  10.2   2.740  4.700 119910   0.0000 0.0000  Bio::Root::RootI::new
> >  8.08   2.150 10.480  47962   0.0000 0.0002  Bio::PrimarySeq::new
> >  7.59   2.020 25.150  23982   0.0001 0.0010
> > Bio::SeqIO::fasta::next_seq
> >  6.12   1.630 17.210  23990   0.0001 0.0007  Bio::Seq::BEGIN
> >  4.92   1.310  1.310 143892   0.0000 0.0000  Bio::Root::Root::verbose
> >  4.43   1.180  3.580  47962   0.0000 0.0001  Bio::PrimarySeq::seq
> >  3.08   0.820  0.820  95924   0.0000 0.0000  Bio::PrimarySeq::alphabet
> >  2.97   0.790  0.790 119910   0.0000 0.0000
> > Bio::Root::Root::_create_object
> >  2.89   0.770  3.630  23981   0.0000 0.0002
> > Bio::Annotation::Collection::new
> >  2.89   0.770  1.830  23981   0.0000 0.0001
> > Bio::Annotation::TypeManager::new
> >  2.85   0.760  1.490  23981   0.0000 0.0001
> > Bio::PrimarySeq::_guess_alphabet
> >  2.48   0.660  3.510  47962   0.0000 0.0001  Bio::SeqIO::alphabet
> >  2.37   0.630 18.190  23981   0.0000 0.0008
> > Bio::Seq::SeqFactory::create
> >  2.22   0.590  0.750  23982   0.0000 0.0000  Bio::Root::IO::_readline
> >
> > I still don't understand why we're getting 6 times as many
> > Bio::Root::RootI::new() calls (and 2 times as many PrimarySeq::new()
> > calls) as we are next_seq() calls ... ?!?!
> >
> > So that folks can play along at home, I've attached the testing script,
> > and you can download my test protein database (~5 Mb, compressed) at:
> > http://www.people.virginia.edu/~ajm6q/sp_protein.lseg.gz
> >
> > % perl -d:DProf testSeqIO.pl sp_protein.lseg
> > % dprofpp -u
> >
> > [ ... etc ... ]
> >
> > -Aaron
> >
> > On Tue, 17 Sep 2002, Tim Bunce wrote:
> >
> >> Aaron,
> >>
> >> Can you send me the script and anything else I'd need to be able
> >> to reproduce this data? Inclusing the data file. I'd like to take
> >> a look at it (performance tuning is a particular interest of mine).
> >>
> >> What version of bioperl were you using? 1.0.2, or from the repository?
> >>
> >> Tim.
> >>
> >> On Thu, Sep 12, 2002 at 03:40:38PM -0400, Aaron J Mackey wrote:
> >>>
> >>> here's some more fuel for the fire; this is a profiling run from a
> >>> super-simple script that invokes a SeqIO::fasta on a 24K sequence
> >>> protein
> >>> database, and does nothing else but call next_seq ... 24K calls to
> >>> next_seq, but 120K calls to _rearrange and BEGIN!!
> >>>
> >>> Total Elapsed Time = 30.84184 Seconds
> >>>          User Time = 30.51116 Seconds
> >>> Exclusive Times
> >>> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
> >>>  23.0   7.040  6.980 119910   0.0001 0.0001
> >>> Bio::Root::RootI::_rearrange
> >>>  11.6   3.539  5.530 119914   0.0000 0.0000  Bio::Root::RootI::BEGIN
> >>>  9.96   3.039 30.811  23982   0.0001 0.0013
> >>> Bio::SeqIO::fasta::next_seq
> >>>  8.81   2.689 12.976  47969   0.0001 0.0003  Bio::PrimarySeq::BEGIN
> >>>  5.67   1.730  1.658 143892   0.0000 0.0000  Bio::Root::Root::verbose
> >>>  5.21   1.589 20.041  23981   0.0001 0.0008  Bio::Seq::new
> >>>  3.77   1.150  3.434  47962   0.0000 0.0001  Bio::PrimarySeq::seq
> >>>  3.02   0.920  5.134  47962   0.0000 0.0001  Bio::SeqIO::alphabet
> >>>  2.92   0.890  2.454  23981   0.0000 0.0001
> >>> Bio::Annotation::TypeManager::new
> >>>  2.88   0.880  4.514  23981   0.0000 0.0002
> >>> Bio::Annotation::Collection::new
> >>>  2.75   0.840  0.792  95924   0.0000 0.0000
> >>> Bio::PrimarySeq::alphabet
> >>>  2.72   0.830  0.850 119917   0.0000 0.0000  Bio::Root::Root::BEGIN
> >>>  2.65   0.810 21.299  23981   0.0000 0.0009
> >>> Bio::Seq::SeqFactory::create
> >>>  2.52   0.770  0.734  71943   0.0000 0.0000
> >>> Bio::Annotation::TypeManager::_add
> >>>                                              _type_map
> >>>  2.43   0.740  1.336  23981   0.0000 0.0001
> >>> Bio::PrimarySeq::_guess_alphabet
> >>>
> >>> here's what the call trace looks like, for each $io->next_seq call:
> >>>
> >>> Bio::SeqIO::fasta::next_seq
> >>>    Bio::Root::IO::_readline
> >>>       File::Temp::BEGIN
> >>>    Bio::SeqIO::alphabet
> >>>    Bio::SeqIO::sequence_factory
> >>>    Bio::Seq::SeqFactory::create
> >>>       Bio::Seq::SeqFactory::type
> >>>       Bio::Root::Root::verbose
> >>>       Bio::Seq::new
> >>>          Bio::Root::RootI::new
> >>>             Bio::Root::Root::BEGIN
> >>>             Bio::Root::Root::verbose
> >>>          Bio::PrimarySeq::BEGIN
> >>>             Bio::Root::RootI::new
> >>>                Bio::Root::Root::BEGIN
> >>>                Bio::Root::Root::verbose
> >>>             Bio::Root::RootI::_rearrange
> >>>             Bio::PrimarySeq::seq
> >>>                Bio::Root::RootI::_rearrange
> >>>                Bio::PrimarySeq::validate_seq
> >>>                Bio::PrimarySeq::alphabet
> >>>                Bio::PrimarySeq::_guess_alphabet
> >>>                   Bio::PrimarySeq::seq
> >>>                      Bio::Root::RootI::_rearrange
> >>>                   Bio::PrimarySeq::alphabet
> >>>             Bio::PrimarySeq::display_id
> >>>             Bio::PrimarySeq::primary_id
> >>>             Bio::PrimarySeq::desc
> >>>          Bio::Root::RootI::_rearrange
> >>>          Bio::Seq::primary_id
> >>>          Bio::Annotation::Collection::new
> >>>             Bio::Root::RootI::new
> >>>                Bio::Root::Root::BEGIN
> >>>                Bio::Root::Root::verbose
> >>>             Bio::Annotation::TypeManager::new
> >>>                Bio::Root::RootI::new
> >>>                   Bio::Root::Root::BEGIN
> >>>                   Bio::Root::Root::verbose
> >>>                Bio::Annotation::TypeManager::_add_type_map
> >>>                Bio::Annotation::TypeManager::_add_type_map
> >>>                Bio::Annotation::TypeManager::_add_type_map
> >>>             Bio::Annotation::Collection::_typemap
> >>>          Bio::Seq::annotation
> >>>          Bio::Seq::primary_seq
> >>>             UNIVERSAL::isa
> >>>             Bio::Seq::top_SeqFeatures
> >>>    Bio::Seq::alphabet
> >>>       Bio::Seq::primary_seq
> >>>       Bio::PrimarySeq::alphabet
> >>>    Bio::SeqIO::alphabet
> >>>       Bio::PrimarySeq::BEGIN
> >>>          Bio::Root::RootI::new
> >>>             Bio::Root::Root::BEGIN
> >>>             Bio::Root::Root::verbose
> >>>          Bio::Root::RootI::_rearrange
> >>>          Bio::PrimarySeq::alphabet
> >>>       Bio::Root::RootI::DESTROY
> >>>          Bio::Root::Root::_cleanup_methods
> >>> Bio::Seq::DESTROY
> >>> Bio::Root::RootI::DESTROY
> >>>    Bio::Root::Root::_cleanup_methods
> >>> Bio::Root::RootI::DESTROY
> >>>    Bio::Root::Root::_cleanup_methods
> >>> Bio::Root::RootI::DESTROY
> >>>    Bio::Root::Root::_cleanup_methods
> >>>
> >>>
> >>> There's a fair amount of handling in Bio::Root::Root BEGIN that
> >>> involves setting up Error usage; a $DONT_USE_ERROR in my test script
> >>> improves things a little (down to 25 seconds instead of 30), but
> >>> doesn't
> >>> change much of the profile percentages.
> >>>
> >>> -Aaron
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l@bioperl.org
> >>> http://bioperl.org/mailman/listinfo/bioperl-l
> >>
> >
> > --
> >  Aaron J Mackey
> >  Pearson Laboratory
> >  University of Virginia
> >  (434) 924-2821
> >  amackey@virginia.edu
> >
> >
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu