[MOBY-dev] XML parsing takes forever

Kenny Billiau kebil at psb.ugent.be
Wed Nov 5 10:57:00 EST 2008


Hi,

after some more tests, I've succesfully reduced the parsing time from over 
1 hour 15 minutes to 17 seconds. Which is not ideal yet, but at least workable :)

This is the profiling result for the original environment. The test xml file is the original file I've sent here without the sequence.

[davebowman at hal2000 ]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_no_dna.xml
Trying to parse.. Done.
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 4348.764 Seconds
   User+System Time = 4344.374 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  97.3   4227. 4227.2 399336   0.0011 0.0011  IO::Scalar::print
  0.88   38.06 4357.4 189130   0.0000 0.0023  Dumpvalue::unwrap
  0.80   34.54 51.532 369048   0.0000 0.0000  Dumpvalue::stringify
  0.54   23.63 23.634 463568   0.0000 0.0000  File::Spec::Unix::canonpath
  0.33   14.23 4356.7 188364   0.0000 0.0023  Dumpvalue::DumpElem


First I simply removed the overload operator "" (as_string). I still get the same results and processing time is reduced drastically.
I don't really know why this way of stringifying (complex) perl types is used here. 
Is there a special reason for it? Can't we simply use Data::Dumper (which prints to a $var by default)?

[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 35.85114 Seconds
   User+System Time = 33.54114 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  64.5   21.64 21.646 463568   0.0000 0.0000  File::Spec::Unix::canonpath
  22.9   7.698 31.602 231569   0.0000 0.0000  File::Spec::Unix::catfile
  19.7   6.607  6.607   2144   0.0031 0.0031  MOSES::MOBY::Cache::Registries::init
  18.6   6.251 37.772   2143   0.0029 0.0176  MOSES::MOBY::Generators::Utils::find_file
  6.80   2.282 13.448 231569   0.0000 0.0000  File::Spec::Unix::catdir

A lot of calls are made to MOSES::MOBY::Generators::Utils::find_file, 
which in it's turn does a lot of calls with File::Spec. So I simply cache the results.
Time is reduced to 17 seconds:

Total Elapsed Time = 17.34032 Seconds
   User+System Time = 13.20032 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  51.5   6.807  6.807   2144   0.0032 0.0032  MOSES::MOBY::Cache::Registries::init
  9.01   1.190  2.005 162858   0.0000 0.0000  MOSES::MOBY::Base::__ANON__
  4.65   0.614  0.614 310751   0.0000 0.0000  XML::LibXML::Node::nodeName
  4.64   0.612  1.655  32171   0.0000 0.0001  XML::LibXML::Element::getChildrenByTagName
  4.48   0.592  0.592   5515   0.0001 0.0001  MOSES::MOBY::Cache::Central::_clean

More then half of the time still goes to building up the MOBY environment for each processed element in the XML file. Isn't this a bit overkill? ;)
Is there a way to make a MOSES::MOBY factory that can handle the building up of the moby environment just once in the same (mod_perl) call? Or is there a reason this has been avoided?

I've included the patches for the result caching in 
MOSES::MOBY::Generators::Utils::find_file . there's not much to it. My 
version of MOSES::MOBY still is the one uploaded to cpan: Module MOSES::MOBY (E/EK/EKAWAS/MOSES-MOBY-0.86.tar.gz)

happy hacking,
Kenny


On Tue, 4 Nov 2008, Kenny Billiau wrote:

> Hi,
>
> I've run a little testprogram to see where most of the time is spent. I use a 
> stripped down version of the xml file I've sent you. All but a few 
> ElementContainers and no sequence. Just to keep the processing time low 
> enough to debug.
>
>     use MOSES::MOBY::Parser;
>
>       open(FILE, '<', $ARGV[0]);
>       my @lines = <FILE>;
>       close FILE;
>
>       my $parser = MOSES::MOBY::Parser->new();
>       print $parser->parse( method => 'string', data => join(q{}, @lines) );
>
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_small.xml
> [davebowman at hal2000 ~]$ dprofpp
> Exporter::Heavy::heavy_export has 14 unstacked calls in outer
> Config::Simple::import_from has 1 unstacked calls in outer
> utf8::AUTOLOAD has -1 unstacked calls in outer
> Config::Simple::  verbose has 1 unstacked calls in outer
> Config::Simple::  vars has 1 unstacked calls in outer
> AutoLoader::AUTOLOAD has -3 unstacked calls in outer
> Fcntl::  AUTOLOAD has -6 unstacked calls in outer
> Fcntl: :__ANON__ has 6 unstacked calls in outer
> Exporter::export has -14 unstacked calls in outer
> utf8::SWASHNEW has 1 unstacked calls in outer
> Total Elapsed Time = 8.470220 Seconds
>  User+System Time = 7.750220 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>   68.1   5.280  5.280  69736   0.0001 0.0001  IO::Scalar::print
>   6.94   0.538  0.783  64088   0.0000 0.0000  Dumpvalue::stringify
>   6.04   0.468  7.173  33755   0.0000 0.0002  Dumpvalue::unwrap
>   4.30   0.333  0.333  47091   0.0000 0.0000  File::Spec::Unix::canonpath
>   2.43   0.188  7.157  33006   0.0000 0.0002  Dumpvalue::DumpElem
>   2.13   0.165  0.165  59776   0.0000 0.0000  overload::mycan
>   1.78   0.138  0.224  15990   0.0000 0.0000  MOSES::MOBY::Base::__ANON__
>   1.42   0.110  0.708    539   0.0002 0.0013
>   MOSES::MOBY::Cache::Central::_createDataTypeFromXML
>   1.39   0.108  0.108 142023   0.0000 0.0000  UNIVERSAL::isa
>   1.29   0.100  0.100    208   0.0005 0.0005
>   MOSES::MOBY::Cache::Registries::init
>   1.20   0.093  0.357  17359   0.0000 0.0000  overload::OverloadedStringify
>   1.16   0.090  0.090  30407   0.0000 0.0000  XML::LibXML::Node::nodeName
>   1.02   0.079  0.079    539   0.0001 0.0001
>   MOSES::MOBY::Cache::Central::_clean
>   0.99   0.077  0.418    207   0.0004 0.0020
>   MOSES::MOBY::Generators::Utils::find_file
>   0.94   0.073  0.203   3147   0.0000 0.0001
>   XML::LibXML::Element::getChildrenByTagName
>
> It's fairly obivous now ;)
>
> If I remove the overload operator in MOSES::MOBY::Base, time is reduced 
> 15fold! But for the 250KB file I've sent, it still takes us a minute to 
> parse.
>
> If I simply parse the file with XML::LibXML and no callbacks or custom 
> handlers, it's parsed instantly:
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_small.xml
> [kebil at psbdev01 ~]$ dprofpp
> Total Elapsed Time = 0.079034 Seconds
>    User+System Time = 0.079034 Seconds
>
>
> -Kenny, slowly getting to the bottom of this ;)
>
>
> On Thu, 30 Oct 2008, Edward Kawas wrote:
>
>>   Now I can confirm that it is slow parsing and not an infinite loop! The
>>   XML
>>   file took over an hour on my virtual machine to parse. I will see what I
>>   can
>>   do with other SAX parsers for Perl.
>>
>>   Thanks,
>>
>>   Eddie
>>
>>   -----Original Message-----
>>   From: moby-dev-bounces at lists.open-bio.org
>>   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of José María
>>   Fernández González
>>   Sent: October-30-08 12:55 PM
>>   To: Core developer announcements
>>   Subject: Re: [MOBY-dev] XML parsing takes forever
>>
>>   Hi everybody,
>>   	I'm using the XML::LibXML SAX parser for other projects and tasks,
>>   and it can
>>   be really, really slow. You don't realize that until you are parsing a
>>   large
>>   file from a pipe, and you see that most of the time the program which is
>>   parsing is CPU (and not I/O) bounded.
>>
>>    Best Regards,
>>     José María
>>
>>   Edward Kawas wrote:
>> >   I think that you are right about the parser not going into an infinite
>>   loop.
>> >   For some reason, the SAX parser seems to be really slow (although, I 
>> >   don't
>> >   want to say that until I know for sure).
>> > 
>> >   Eddie
>> > 
>> >   -----Original Message-----
>> >   From: moby-dev-bounces at lists.open-bio.org
>> >   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny Billiau
>> >   Sent: October-30-08 11:56 AM
>> >   To: Core developer announcements
>> >   Subject: Re: [MOBY-dev] XML parsing takes forever
>> > 
>> >   Hi,
>> > 
>> >   thx for helping to investigate the problem!
>> > 
>> >   I'm pretty sure the parser gets out of it's mess. When you truncate the
>> >   XML file by removing all but a few ElementContainers, it takes awhile, 
>> >   but
>> 
>> >   it actually returns without error.
>> > 
>> >   And memory consuption doesn't seem to be a problem either, as mostly 
>> >   'top'
>> 
>> >   reveals that the invoked webservice script uses a steady amount of 
>> >   memory.
>> > 
>> >   thx,
>> >   Kenny
>> > 
>> >   On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > 
>> > >   So this one might take some time... from what I can see, our parser 
>> > >   goes
>> > >   into an infinite loop (although, I am not certain about that ...).
>> > > 
>> > >   I will play around with this XML file and get back to you.
>> > > 
>> > >   Thanks,
>> > > 
>> > >   Eddie
>> > > 
>> > >   -----Original Message-----
>> > >   From: moby-dev-bounces at lists.open-bio.org
>> > >   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny 
>> > >   Billiau
>> > >   Sent: October-30-08 8:03 AM
>> > >   To: Core developer announcements
>> > >   Subject: Re: [MOBY-dev] XML parsing takes forever
>> > > 
>> > >   Hi,
>> > > 
>> > >   here you go,
>> > > 
>> > >   Kenny
>> > > 
>> > >   On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > > 
>> > > >   Hi,
>> > > > 
>> > > >   This is the right list! Could you send me the XML for your second
>> >   service?
>> > > >   Thanks,
>> > > > 
>> > > >   Eddie
>> > >  _______________________________________________
>> > >   MOBY-dev mailing list
>> > >   MOBY-dev at lists.open-bio.org
>> > >   http://lists.open-bio.org/mailman/listinfo/moby-dev
>> > > 
>> >
>> 
>> 
>
>

-- 
-------------- next part --------------
--- /usr/lib/perl5/site_perl/5.8.5/MOSES/MOBY/Generators/Utils.pm	2008-04-29 21:45:41.000000000 +0200
+++ ../modules/MOSES/MOBY/Generators/Utils.pm	2008-11-05 16:47:23.000000000 +0100
@@ -26,7 +26,7 @@
  # find a file located somewhere in @INC
  use MOSES::MOBY::Generators::Utils;
  my $file = MOSES::MOBY::Generators::Utils->find_file ('resource.file');
- 
+
 =cut
 
 =head1 DESCRIPTION
@@ -61,17 +61,29 @@
 
 =cut
 
+my %full_path_of = ();
+
 sub find_file {
     my ($self, $default_start, @names) = @_;
     my $fixed_part = File::Spec->catfile (@names);
+    return $full_path_of{ $fixed_part } if exists $full_path_of{ $fixed_part };
+
     my $result = File::Spec->catfile ($default_start, $fixed_part);
-    return $result if -e $result;
+    if (-e $result) {
+        $full_path_of{ $fixed_part } = $result;
+        return $result;
+    }
 
     foreach my $idx (0 .. $#INC) {
-	$result = File::Spec->catfile ($INC[$idx], $fixed_part);
-	return $result if -e $result;
+        $result = File::Spec->catfile ($INC[$idx], $fixed_part);
+        if (-e $result) {
+            $full_path_of{ $fixed_part } = $result;
+            return $result;
+        }
     }
-    return File::Spec->catfile ($default_start, $fixed_part);
+    $result = File::Spec->catfile ($default_start, $fixed_part);
+    $full_path_of{ $fixed_part } = $result;
+    return $result;
 }
 
 1;


More information about the MOBY-dev mailing list