[MOBY-dev] XML parsing takes forever
Kenny Billiau
kebil at psb.ugent.be
Wed Nov 5 10:57:00 EST 2008
Hi,
after some more tests, I've succesfully reduced the parsing time from over
1 hour 15 minutes to 17 seconds. Which is not ideal yet, but at least workable :)
This is the profiling result for the original environment. The test xml file is the original file I've sent here without the sequence.
[davebowman at hal2000 ]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_no_dna.xml
Trying to parse.. Done.
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 4348.764 Seconds
User+System Time = 4344.374 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
97.3 4227. 4227.2 399336 0.0011 0.0011 IO::Scalar::print
0.88 38.06 4357.4 189130 0.0000 0.0023 Dumpvalue::unwrap
0.80 34.54 51.532 369048 0.0000 0.0000 Dumpvalue::stringify
0.54 23.63 23.634 463568 0.0000 0.0000 File::Spec::Unix::canonpath
0.33 14.23 4356.7 188364 0.0000 0.0023 Dumpvalue::DumpElem
First I simply removed the overload operator "" (as_string). I still get the same results and processing time is reduced drastically.
I don't really know why this way of stringifying (complex) perl types is used here.
Is there a special reason for it? Can't we simply use Data::Dumper (which prints to a $var by default)?
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 35.85114 Seconds
User+System Time = 33.54114 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
64.5 21.64 21.646 463568 0.0000 0.0000 File::Spec::Unix::canonpath
22.9 7.698 31.602 231569 0.0000 0.0000 File::Spec::Unix::catfile
19.7 6.607 6.607 2144 0.0031 0.0031 MOSES::MOBY::Cache::Registries::init
18.6 6.251 37.772 2143 0.0029 0.0176 MOSES::MOBY::Generators::Utils::find_file
6.80 2.282 13.448 231569 0.0000 0.0000 File::Spec::Unix::catdir
A lot of calls are made to MOSES::MOBY::Generators::Utils::find_file,
which in it's turn does a lot of calls with File::Spec. So I simply cache the results.
Time is reduced to 17 seconds:
Total Elapsed Time = 17.34032 Seconds
User+System Time = 13.20032 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
51.5 6.807 6.807 2144 0.0032 0.0032 MOSES::MOBY::Cache::Registries::init
9.01 1.190 2.005 162858 0.0000 0.0000 MOSES::MOBY::Base::__ANON__
4.65 0.614 0.614 310751 0.0000 0.0000 XML::LibXML::Node::nodeName
4.64 0.612 1.655 32171 0.0000 0.0001 XML::LibXML::Element::getChildrenByTagName
4.48 0.592 0.592 5515 0.0001 0.0001 MOSES::MOBY::Cache::Central::_clean
More then half of the time still goes to building up the MOBY environment for each processed element in the XML file. Isn't this a bit overkill? ;)
Is there a way to make a MOSES::MOBY factory that can handle the building up of the moby environment just once in the same (mod_perl) call? Or is there a reason this has been avoided?
I've included the patches for the result caching in
MOSES::MOBY::Generators::Utils::find_file . there's not much to it. My
version of MOSES::MOBY still is the one uploaded to cpan: Module MOSES::MOBY (E/EK/EKAWAS/MOSES-MOBY-0.86.tar.gz)
happy hacking,
Kenny
On Tue, 4 Nov 2008, Kenny Billiau wrote:
> Hi,
>
> I've run a little testprogram to see where most of the time is spent. I use a
> stripped down version of the xml file I've sent you. All but a few
> ElementContainers and no sequence. Just to keep the processing time low
> enough to debug.
>
> use MOSES::MOBY::Parser;
>
> open(FILE, '<', $ARGV[0]);
> my @lines = <FILE>;
> close FILE;
>
> my $parser = MOSES::MOBY::Parser->new();
> print $parser->parse( method => 'string', data => join(q{}, @lines) );
>
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_small.xml
> [davebowman at hal2000 ~]$ dprofpp
> Exporter::Heavy::heavy_export has 14 unstacked calls in outer
> Config::Simple::import_from has 1 unstacked calls in outer
> utf8::AUTOLOAD has -1 unstacked calls in outer
> Config::Simple:: verbose has 1 unstacked calls in outer
> Config::Simple:: vars has 1 unstacked calls in outer
> AutoLoader::AUTOLOAD has -3 unstacked calls in outer
> Fcntl:: AUTOLOAD has -6 unstacked calls in outer
> Fcntl: :__ANON__ has 6 unstacked calls in outer
> Exporter::export has -14 unstacked calls in outer
> utf8::SWASHNEW has 1 unstacked calls in outer
> Total Elapsed Time = 8.470220 Seconds
> User+System Time = 7.750220 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 68.1 5.280 5.280 69736 0.0001 0.0001 IO::Scalar::print
> 6.94 0.538 0.783 64088 0.0000 0.0000 Dumpvalue::stringify
> 6.04 0.468 7.173 33755 0.0000 0.0002 Dumpvalue::unwrap
> 4.30 0.333 0.333 47091 0.0000 0.0000 File::Spec::Unix::canonpath
> 2.43 0.188 7.157 33006 0.0000 0.0002 Dumpvalue::DumpElem
> 2.13 0.165 0.165 59776 0.0000 0.0000 overload::mycan
> 1.78 0.138 0.224 15990 0.0000 0.0000 MOSES::MOBY::Base::__ANON__
> 1.42 0.110 0.708 539 0.0002 0.0013
> MOSES::MOBY::Cache::Central::_createDataTypeFromXML
> 1.39 0.108 0.108 142023 0.0000 0.0000 UNIVERSAL::isa
> 1.29 0.100 0.100 208 0.0005 0.0005
> MOSES::MOBY::Cache::Registries::init
> 1.20 0.093 0.357 17359 0.0000 0.0000 overload::OverloadedStringify
> 1.16 0.090 0.090 30407 0.0000 0.0000 XML::LibXML::Node::nodeName
> 1.02 0.079 0.079 539 0.0001 0.0001
> MOSES::MOBY::Cache::Central::_clean
> 0.99 0.077 0.418 207 0.0004 0.0020
> MOSES::MOBY::Generators::Utils::find_file
> 0.94 0.073 0.203 3147 0.0000 0.0001
> XML::LibXML::Element::getChildrenByTagName
>
> It's fairly obivous now ;)
>
> If I remove the overload operator in MOSES::MOBY::Base, time is reduced
> 15fold! But for the 250KB file I've sent, it still takes us a minute to
> parse.
>
> If I simply parse the file with XML::LibXML and no callbacks or custom
> handlers, it's parsed instantly:
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_small.xml
> [kebil at psbdev01 ~]$ dprofpp
> Total Elapsed Time = 0.079034 Seconds
> User+System Time = 0.079034 Seconds
>
>
> -Kenny, slowly getting to the bottom of this ;)
>
>
> On Thu, 30 Oct 2008, Edward Kawas wrote:
>
>> Now I can confirm that it is slow parsing and not an infinite loop! The
>> XML
>> file took over an hour on my virtual machine to parse. I will see what I
>> can
>> do with other SAX parsers for Perl.
>>
>> Thanks,
>>
>> Eddie
>>
>> -----Original Message-----
>> From: moby-dev-bounces at lists.open-bio.org
>> [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of José María
>> Fernández González
>> Sent: October-30-08 12:55 PM
>> To: Core developer announcements
>> Subject: Re: [MOBY-dev] XML parsing takes forever
>>
>> Hi everybody,
>> I'm using the XML::LibXML SAX parser for other projects and tasks,
>> and it can
>> be really, really slow. You don't realize that until you are parsing a
>> large
>> file from a pipe, and you see that most of the time the program which is
>> parsing is CPU (and not I/O) bounded.
>>
>> Best Regards,
>> José María
>>
>> Edward Kawas wrote:
>> > I think that you are right about the parser not going into an infinite
>> loop.
>> > For some reason, the SAX parser seems to be really slow (although, I
>> > don't
>> > want to say that until I know for sure).
>> >
>> > Eddie
>> >
>> > -----Original Message-----
>> > From: moby-dev-bounces at lists.open-bio.org
>> > [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny Billiau
>> > Sent: October-30-08 11:56 AM
>> > To: Core developer announcements
>> > Subject: Re: [MOBY-dev] XML parsing takes forever
>> >
>> > Hi,
>> >
>> > thx for helping to investigate the problem!
>> >
>> > I'm pretty sure the parser gets out of it's mess. When you truncate the
>> > XML file by removing all but a few ElementContainers, it takes awhile,
>> > but
>>
>> > it actually returns without error.
>> >
>> > And memory consuption doesn't seem to be a problem either, as mostly
>> > 'top'
>>
>> > reveals that the invoked webservice script uses a steady amount of
>> > memory.
>> >
>> > thx,
>> > Kenny
>> >
>> > On Thu, 30 Oct 2008, Edward Kawas wrote:
>> >
>> > > So this one might take some time... from what I can see, our parser
>> > > goes
>> > > into an infinite loop (although, I am not certain about that ...).
>> > >
>> > > I will play around with this XML file and get back to you.
>> > >
>> > > Thanks,
>> > >
>> > > Eddie
>> > >
>> > > -----Original Message-----
>> > > From: moby-dev-bounces at lists.open-bio.org
>> > > [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny
>> > > Billiau
>> > > Sent: October-30-08 8:03 AM
>> > > To: Core developer announcements
>> > > Subject: Re: [MOBY-dev] XML parsing takes forever
>> > >
>> > > Hi,
>> > >
>> > > here you go,
>> > >
>> > > Kenny
>> > >
>> > > On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > This is the right list! Could you send me the XML for your second
>> > service?
>> > > > Thanks,
>> > > >
>> > > > Eddie
>> > > _______________________________________________
>> > > MOBY-dev mailing list
>> > > MOBY-dev at lists.open-bio.org
>> > > http://lists.open-bio.org/mailman/listinfo/moby-dev
>> > >
>> >
>>
>>
>
>
--
-------------- next part --------------
--- /usr/lib/perl5/site_perl/5.8.5/MOSES/MOBY/Generators/Utils.pm 2008-04-29 21:45:41.000000000 +0200
+++ ../modules/MOSES/MOBY/Generators/Utils.pm 2008-11-05 16:47:23.000000000 +0100
@@ -26,7 +26,7 @@
# find a file located somewhere in @INC
use MOSES::MOBY::Generators::Utils;
my $file = MOSES::MOBY::Generators::Utils->find_file ('resource.file');
-
+
=cut
=head1 DESCRIPTION
@@ -61,17 +61,29 @@
=cut
+my %full_path_of = ();
+
sub find_file {
my ($self, $default_start, @names) = @_;
my $fixed_part = File::Spec->catfile (@names);
+ return $full_path_of{ $fixed_part } if exists $full_path_of{ $fixed_part };
+
my $result = File::Spec->catfile ($default_start, $fixed_part);
- return $result if -e $result;
+ if (-e $result) {
+ $full_path_of{ $fixed_part } = $result;
+ return $result;
+ }
foreach my $idx (0 .. $#INC) {
- $result = File::Spec->catfile ($INC[$idx], $fixed_part);
- return $result if -e $result;
+ $result = File::Spec->catfile ($INC[$idx], $fixed_part);
+ if (-e $result) {
+ $full_path_of{ $fixed_part } = $result;
+ return $result;
+ }
}
- return File::Spec->catfile ($default_start, $fixed_part);
+ $result = File::Spec->catfile ($default_start, $fixed_part);
+ $full_path_of{ $fixed_part } = $result;
+ return $result;
}
1;
More information about the MOBY-dev
mailing list