[MOBY-dev] XML parsing takes forever

Edward Kawas edward.kawas at gmail.com
Thu Nov 6 13:37:55 EST 2008


Hi Kevin,

I applied your patch to Utils.pm. I also modified
MOSES::MOBY::Cache::Registries so that a file read isn’t done every time
that module is instantiated. Finally, I moved IO::Scalar to IO::String.

Now, instead of 1 hour+ to parse the XML file, we have got it down to just
over a minute. I am hesitant to remove the as_string overload in Base.pm. We
use it for displaying our datatypes nicely (better than if Data::Dumper was
used). If we can find a way to produce the same information as as_string and
be more efficient, I would swap that in right away.

The changes are in the CVS @ moby-live/Perl/MOSES-MOBY .

Please keep hacking away! I really appreciate all of your help so far!

Thanks.

Eddie



-----Original Message-----
From: moby-dev-bounces at lists.open-bio.org
[mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny Billiau
Sent: November-05-08 7:57 AM
To: Core developer announcements
Subject: Re: [MOBY-dev] XML parsing takes forever

Hi,

after some more tests, I've succesfully reduced the parsing time from over
1 hour 15 minutes to 17 seconds. Which is not ideal yet, but at least
workable :)

This is the profiling result for the original environment. The test xml file
is the original file I've sent here without the sequence.

[davebowman at hal2000 ]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_no_dna.xml
Trying to parse.. Done.
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 4348.764 Seconds
   User+System Time = 4344.374 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  97.3   4227. 4227.2 399336   0.0011 0.0011  IO::Scalar::print
  0.88   38.06 4357.4 189130   0.0000 0.0023  Dumpvalue::unwrap
  0.80   34.54 51.532 369048   0.0000 0.0000  Dumpvalue::stringify
  0.54   23.63 23.634 463568   0.0000 0.0000  File::Spec::Unix::canonpath
  0.33   14.23 4356.7 188364   0.0000 0.0023  Dumpvalue::DumpElem


First I simply removed the overload operator "" (as_string). I still get the
same results and processing time is reduced drastically.
I don't really know why this way of stringifying (complex) perl types is
used here. 
Is there a special reason for it? Can't we simply use Data::Dumper (which
prints to a $var by default)?

[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 35.85114 Seconds
   User+System Time = 33.54114 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  64.5   21.64 21.646 463568   0.0000 0.0000  File::Spec::Unix::canonpath
  22.9   7.698 31.602 231569   0.0000 0.0000  File::Spec::Unix::catfile
  19.7   6.607  6.607   2144   0.0031 0.0031
MOSES::MOBY::Cache::Registries::init
  18.6   6.251 37.772   2143   0.0029 0.0176
MOSES::MOBY::Generators::Utils::find_file
  6.80   2.282 13.448 231569   0.0000 0.0000  File::Spec::Unix::catdir

A lot of calls are made to MOSES::MOBY::Generators::Utils::find_file,
which in it's turn does a lot of calls with File::Spec. So I simply cache
the results.
Time is reduced to 17 seconds:

Total Elapsed Time = 17.34032 Seconds
   User+System Time = 13.20032 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  51.5   6.807  6.807   2144   0.0032 0.0032
MOSES::MOBY::Cache::Registries::init
  9.01   1.190  2.005 162858   0.0000 0.0000  MOSES::MOBY::Base::__ANON__
  4.65   0.614  0.614 310751   0.0000 0.0000  XML::LibXML::Node::nodeName
  4.64   0.612  1.655  32171   0.0000 0.0001
XML::LibXML::Element::getChildrenByTagName
  4.48   0.592  0.592   5515   0.0001 0.0001
MOSES::MOBY::Cache::Central::_clean

More then half of the time still goes to building up the MOBY environment
for each processed element in the XML file. Isn't this a bit overkill? ;) Is
there a way to make a MOSES::MOBY factory that can handle the building up of
the moby environment just once in the same (mod_perl) call? Or is there a
reason this has been avoided?

I've included the patches for the result caching in
MOSES::MOBY::Generators::Utils::find_file . there's not much to it. My
version of MOSES::MOBY still is the one uploaded to cpan: Module MOSES::MOBY
(E/EK/EKAWAS/MOSES-MOBY-0.86.tar.gz)

happy hacking,
Kenny


On Tue, 4 Nov 2008, Kenny Billiau wrote:

> Hi,
>
> I've run a little testprogram to see where most of the time is spent. 
> I use a stripped down version of the xml file I've sent you. All but a 
> few ElementContainers and no sequence. Just to keep the processing 
> time low enough to debug.
>
>     use MOSES::MOBY::Parser;
>
>       open(FILE, '<', $ARGV[0]);
>       my @lines = <FILE>;
>       close FILE;
>
>       my $parser = MOSES::MOBY::Parser->new();
>       print $parser->parse( method => 'string', data => join(q{}, 
> @lines) );
>
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl 
> mobysav_small.xml [davebowman at hal2000 ~]$ dprofpp 
> Exporter::Heavy::heavy_export has 14 unstacked calls in outer 
> Config::Simple::import_from has 1 unstacked calls in outer 
> utf8::AUTOLOAD has -1 unstacked calls in outer
> Config::Simple::  verbose has 1 unstacked calls in outer
> Config::Simple::  vars has 1 unstacked calls in outer 
> AutoLoader::AUTOLOAD has -3 unstacked calls in outer
> Fcntl::  AUTOLOAD has -6 unstacked calls in outer
> Fcntl: :__ANON__ has 6 unstacked calls in outer Exporter::export has 
> -14 unstacked calls in outer utf8::SWASHNEW has 1 unstacked calls in 
> outer Total Elapsed Time = 8.470220 Seconds  User+System Time = 
> 7.750220 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call 
> Csec/c  Name
>   68.1   5.280  5.280  69736   0.0001 0.0001  IO::Scalar::print
>   6.94   0.538  0.783  64088   0.0000 0.0000  Dumpvalue::stringify
>   6.04   0.468  7.173  33755   0.0000 0.0002  Dumpvalue::unwrap
>   4.30   0.333  0.333  47091   0.0000 0.0000  File::Spec::Unix::canonpath
>   2.43   0.188  7.157  33006   0.0000 0.0002  Dumpvalue::DumpElem
>   2.13   0.165  0.165  59776   0.0000 0.0000  overload::mycan
>   1.78   0.138  0.224  15990   0.0000 0.0000  MOSES::MOBY::Base::__ANON__
>   1.42   0.110  0.708    539   0.0002 0.0013
>   MOSES::MOBY::Cache::Central::_createDataTypeFromXML
>   1.39   0.108  0.108 142023   0.0000 0.0000  UNIVERSAL::isa
>   1.29   0.100  0.100    208   0.0005 0.0005
>   MOSES::MOBY::Cache::Registries::init
>   1.20   0.093  0.357  17359   0.0000 0.0000
overload::OverloadedStringify
>   1.16   0.090  0.090  30407   0.0000 0.0000  XML::LibXML::Node::nodeName
>   1.02   0.079  0.079    539   0.0001 0.0001
>   MOSES::MOBY::Cache::Central::_clean
>   0.99   0.077  0.418    207   0.0004 0.0020
>   MOSES::MOBY::Generators::Utils::find_file
>   0.94   0.073  0.203   3147   0.0000 0.0001
>   XML::LibXML::Element::getChildrenByTagName
>
> It's fairly obivous now ;)
>
> If I remove the overload operator in MOSES::MOBY::Base, time is 
> reduced 15fold! But for the 250KB file I've sent, it still takes us a 
> minute to parse.
>
> If I simply parse the file with XML::LibXML and no callbacks or custom 
> handlers, it's parsed instantly:
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl 
> mobysav_small.xml
> [kebil at psbdev01 ~]$ dprofpp
> Total Elapsed Time = 0.079034 Seconds
>    User+System Time = 0.079034 Seconds
>
>
> -Kenny, slowly getting to the bottom of this ;)
>
>
> On Thu, 30 Oct 2008, Edward Kawas wrote:
>
>>   Now I can confirm that it is slow parsing and not an infinite loop! The
>>   XML
>>   file took over an hour on my virtual machine to parse. I will see what
I
>>   can
>>   do with other SAX parsers for Perl.
>>
>>   Thanks,
>>
>>   Eddie
>>
>>   -----Original Message-----
>>   From: moby-dev-bounces at lists.open-bio.org
>>   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of José María
>>   Fernández González
>>   Sent: October-30-08 12:55 PM
>>   To: Core developer announcements
>>   Subject: Re: [MOBY-dev] XML parsing takes forever
>>
>>   Hi everybody,
>>   	I'm using the XML::LibXML SAX parser for other projects and tasks,
>>   and it can
>>   be really, really slow. You don't realize that until you are parsing a
>>   large
>>   file from a pipe, and you see that most of the time the program which
is
>>   parsing is CPU (and not I/O) bounded.
>>
>>    Best Regards,
>>     José María
>>
>>   Edward Kawas wrote:
>> >   I think that you are right about the parser not going into an 
>> > infinite
>>   loop.
>> >   For some reason, the SAX parser seems to be really slow (although, I 
>> >   don't
>> >   want to say that until I know for sure).
>> > 
>> >   Eddie
>> > 
>> >   -----Original Message-----
>> >   From: moby-dev-bounces at lists.open-bio.org
>> >   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny
Billiau
>> >   Sent: October-30-08 11:56 AM
>> >   To: Core developer announcements
>> >   Subject: Re: [MOBY-dev] XML parsing takes forever
>> > 
>> >   Hi,
>> > 
>> >   thx for helping to investigate the problem!
>> > 
>> >   I'm pretty sure the parser gets out of it's mess. When you truncate
the
>> >   XML file by removing all but a few ElementContainers, it takes
awhile, 
>> >   but
>> 
>> >   it actually returns without error.
>> > 
>> >   And memory consuption doesn't seem to be a problem either, as mostly 
>> >   'top'
>> 
>> >   reveals that the invoked webservice script uses a steady amount of 
>> >   memory.
>> > 
>> >   thx,
>> >   Kenny
>> > 
>> >   On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > 
>> > >   So this one might take some time... from what I can see, our parser

>> > >   goes
>> > >   into an infinite loop (although, I am not certain about that ...).
>> > > 
>> > >   I will play around with this XML file and get back to you.
>> > > 
>> > >   Thanks,
>> > > 
>> > >   Eddie
>> > > 
>> > >   -----Original Message-----
>> > >   From: moby-dev-bounces at lists.open-bio.org
>> > >   [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny 
>> > >   Billiau
>> > >   Sent: October-30-08 8:03 AM
>> > >   To: Core developer announcements
>> > >   Subject: Re: [MOBY-dev] XML parsing takes forever
>> > > 
>> > >   Hi,
>> > > 
>> > >   here you go,
>> > > 
>> > >   Kenny
>> > > 
>> > >   On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > > 
>> > > >   Hi,
>> > > > 
>> > > >   This is the right list! Could you send me the XML for your 
>> > > > second
>> >   service?
>> > > >   Thanks,
>> > > > 
>> > > >   Eddie
>> > >  _______________________________________________
>> > >   MOBY-dev mailing list
>> > >   MOBY-dev at lists.open-bio.org
>> > >   http://lists.open-bio.org/mailman/listinfo/moby-dev
>> > > 
>> >
>> 
>> 
>
>

-- 




More information about the MOBY-dev mailing list