[MOBY-dev] XML parsing takes forever
Edward Kawas
edward.kawas at gmail.com
Thu Nov 6 13:37:55 EST 2008
Hi Kevin,
I applied your patch to Utils.pm. I also modified
MOSES::MOBY::Cache::Registries so that a file read isnt done every time
that module is instantiated. Finally, I moved IO::Scalar to IO::String.
Now, instead of 1 hour+ to parse the XML file, we have got it down to just
over a minute. I am hesitant to remove the as_string overload in Base.pm. We
use it for displaying our datatypes nicely (better than if Data::Dumper was
used). If we can find a way to produce the same information as as_string and
be more efficient, I would swap that in right away.
The changes are in the CVS @ moby-live/Perl/MOSES-MOBY .
Please keep hacking away! I really appreciate all of your help so far!
Thanks.
Eddie
-----Original Message-----
From: moby-dev-bounces at lists.open-bio.org
[mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny Billiau
Sent: November-05-08 7:57 AM
To: Core developer announcements
Subject: Re: [MOBY-dev] XML parsing takes forever
Hi,
after some more tests, I've succesfully reduced the parsing time from over
1 hour 15 minutes to 17 seconds. Which is not ideal yet, but at least
workable :)
This is the profiling result for the original environment. The test xml file
is the original file I've sent here without the sequence.
[davebowman at hal2000 ]$ /usr/bin/perl -d:DProf mobytest.pl mobysav_no_dna.xml
Trying to parse.. Done.
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 4348.764 Seconds
User+System Time = 4344.374 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
97.3 4227. 4227.2 399336 0.0011 0.0011 IO::Scalar::print
0.88 38.06 4357.4 189130 0.0000 0.0023 Dumpvalue::unwrap
0.80 34.54 51.532 369048 0.0000 0.0000 Dumpvalue::stringify
0.54 23.63 23.634 463568 0.0000 0.0000 File::Spec::Unix::canonpath
0.33 14.23 4356.7 188364 0.0000 0.0023 Dumpvalue::DumpElem
First I simply removed the overload operator "" (as_string). I still get the
same results and processing time is reduced drastically.
I don't really know why this way of stringifying (complex) perl types is
used here.
Is there a special reason for it? Can't we simply use Data::Dumper (which
prints to a $var by default)?
[davebowman at hal2000 ]$ dprofpp
Total Elapsed Time = 35.85114 Seconds
User+System Time = 33.54114 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
64.5 21.64 21.646 463568 0.0000 0.0000 File::Spec::Unix::canonpath
22.9 7.698 31.602 231569 0.0000 0.0000 File::Spec::Unix::catfile
19.7 6.607 6.607 2144 0.0031 0.0031
MOSES::MOBY::Cache::Registries::init
18.6 6.251 37.772 2143 0.0029 0.0176
MOSES::MOBY::Generators::Utils::find_file
6.80 2.282 13.448 231569 0.0000 0.0000 File::Spec::Unix::catdir
A lot of calls are made to MOSES::MOBY::Generators::Utils::find_file,
which in it's turn does a lot of calls with File::Spec. So I simply cache
the results.
Time is reduced to 17 seconds:
Total Elapsed Time = 17.34032 Seconds
User+System Time = 13.20032 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
51.5 6.807 6.807 2144 0.0032 0.0032
MOSES::MOBY::Cache::Registries::init
9.01 1.190 2.005 162858 0.0000 0.0000 MOSES::MOBY::Base::__ANON__
4.65 0.614 0.614 310751 0.0000 0.0000 XML::LibXML::Node::nodeName
4.64 0.612 1.655 32171 0.0000 0.0001
XML::LibXML::Element::getChildrenByTagName
4.48 0.592 0.592 5515 0.0001 0.0001
MOSES::MOBY::Cache::Central::_clean
More then half of the time still goes to building up the MOBY environment
for each processed element in the XML file. Isn't this a bit overkill? ;) Is
there a way to make a MOSES::MOBY factory that can handle the building up of
the moby environment just once in the same (mod_perl) call? Or is there a
reason this has been avoided?
I've included the patches for the result caching in
MOSES::MOBY::Generators::Utils::find_file . there's not much to it. My
version of MOSES::MOBY still is the one uploaded to cpan: Module MOSES::MOBY
(E/EK/EKAWAS/MOSES-MOBY-0.86.tar.gz)
happy hacking,
Kenny
On Tue, 4 Nov 2008, Kenny Billiau wrote:
> Hi,
>
> I've run a little testprogram to see where most of the time is spent.
> I use a stripped down version of the xml file I've sent you. All but a
> few ElementContainers and no sequence. Just to keep the processing
> time low enough to debug.
>
> use MOSES::MOBY::Parser;
>
> open(FILE, '<', $ARGV[0]);
> my @lines = <FILE>;
> close FILE;
>
> my $parser = MOSES::MOBY::Parser->new();
> print $parser->parse( method => 'string', data => join(q{},
> @lines) );
>
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl
> mobysav_small.xml [davebowman at hal2000 ~]$ dprofpp
> Exporter::Heavy::heavy_export has 14 unstacked calls in outer
> Config::Simple::import_from has 1 unstacked calls in outer
> utf8::AUTOLOAD has -1 unstacked calls in outer
> Config::Simple:: verbose has 1 unstacked calls in outer
> Config::Simple:: vars has 1 unstacked calls in outer
> AutoLoader::AUTOLOAD has -3 unstacked calls in outer
> Fcntl:: AUTOLOAD has -6 unstacked calls in outer
> Fcntl: :__ANON__ has 6 unstacked calls in outer Exporter::export has
> -14 unstacked calls in outer utf8::SWASHNEW has 1 unstacked calls in
> outer Total Elapsed Time = 8.470220 Seconds User+System Time =
> 7.750220 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call
> Csec/c Name
> 68.1 5.280 5.280 69736 0.0001 0.0001 IO::Scalar::print
> 6.94 0.538 0.783 64088 0.0000 0.0000 Dumpvalue::stringify
> 6.04 0.468 7.173 33755 0.0000 0.0002 Dumpvalue::unwrap
> 4.30 0.333 0.333 47091 0.0000 0.0000 File::Spec::Unix::canonpath
> 2.43 0.188 7.157 33006 0.0000 0.0002 Dumpvalue::DumpElem
> 2.13 0.165 0.165 59776 0.0000 0.0000 overload::mycan
> 1.78 0.138 0.224 15990 0.0000 0.0000 MOSES::MOBY::Base::__ANON__
> 1.42 0.110 0.708 539 0.0002 0.0013
> MOSES::MOBY::Cache::Central::_createDataTypeFromXML
> 1.39 0.108 0.108 142023 0.0000 0.0000 UNIVERSAL::isa
> 1.29 0.100 0.100 208 0.0005 0.0005
> MOSES::MOBY::Cache::Registries::init
> 1.20 0.093 0.357 17359 0.0000 0.0000
overload::OverloadedStringify
> 1.16 0.090 0.090 30407 0.0000 0.0000 XML::LibXML::Node::nodeName
> 1.02 0.079 0.079 539 0.0001 0.0001
> MOSES::MOBY::Cache::Central::_clean
> 0.99 0.077 0.418 207 0.0004 0.0020
> MOSES::MOBY::Generators::Utils::find_file
> 0.94 0.073 0.203 3147 0.0000 0.0001
> XML::LibXML::Element::getChildrenByTagName
>
> It's fairly obivous now ;)
>
> If I remove the overload operator in MOSES::MOBY::Base, time is
> reduced 15fold! But for the 250KB file I've sent, it still takes us a
> minute to parse.
>
> If I simply parse the file with XML::LibXML and no callbacks or custom
> handlers, it's parsed instantly:
>
> [davebowman at hal2000 ~]$ /usr/bin/perl -d:DProf mobytest.pl
> mobysav_small.xml
> [kebil at psbdev01 ~]$ dprofpp
> Total Elapsed Time = 0.079034 Seconds
> User+System Time = 0.079034 Seconds
>
>
> -Kenny, slowly getting to the bottom of this ;)
>
>
> On Thu, 30 Oct 2008, Edward Kawas wrote:
>
>> Now I can confirm that it is slow parsing and not an infinite loop! The
>> XML
>> file took over an hour on my virtual machine to parse. I will see what
I
>> can
>> do with other SAX parsers for Perl.
>>
>> Thanks,
>>
>> Eddie
>>
>> -----Original Message-----
>> From: moby-dev-bounces at lists.open-bio.org
>> [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of José María
>> Fernández González
>> Sent: October-30-08 12:55 PM
>> To: Core developer announcements
>> Subject: Re: [MOBY-dev] XML parsing takes forever
>>
>> Hi everybody,
>> I'm using the XML::LibXML SAX parser for other projects and tasks,
>> and it can
>> be really, really slow. You don't realize that until you are parsing a
>> large
>> file from a pipe, and you see that most of the time the program which
is
>> parsing is CPU (and not I/O) bounded.
>>
>> Best Regards,
>> José María
>>
>> Edward Kawas wrote:
>> > I think that you are right about the parser not going into an
>> > infinite
>> loop.
>> > For some reason, the SAX parser seems to be really slow (although, I
>> > don't
>> > want to say that until I know for sure).
>> >
>> > Eddie
>> >
>> > -----Original Message-----
>> > From: moby-dev-bounces at lists.open-bio.org
>> > [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny
Billiau
>> > Sent: October-30-08 11:56 AM
>> > To: Core developer announcements
>> > Subject: Re: [MOBY-dev] XML parsing takes forever
>> >
>> > Hi,
>> >
>> > thx for helping to investigate the problem!
>> >
>> > I'm pretty sure the parser gets out of it's mess. When you truncate
the
>> > XML file by removing all but a few ElementContainers, it takes
awhile,
>> > but
>>
>> > it actually returns without error.
>> >
>> > And memory consuption doesn't seem to be a problem either, as mostly
>> > 'top'
>>
>> > reveals that the invoked webservice script uses a steady amount of
>> > memory.
>> >
>> > thx,
>> > Kenny
>> >
>> > On Thu, 30 Oct 2008, Edward Kawas wrote:
>> >
>> > > So this one might take some time... from what I can see, our parser
>> > > goes
>> > > into an infinite loop (although, I am not certain about that ...).
>> > >
>> > > I will play around with this XML file and get back to you.
>> > >
>> > > Thanks,
>> > >
>> > > Eddie
>> > >
>> > > -----Original Message-----
>> > > From: moby-dev-bounces at lists.open-bio.org
>> > > [mailto:moby-dev-bounces at lists.open-bio.org] On Behalf Of Kenny
>> > > Billiau
>> > > Sent: October-30-08 8:03 AM
>> > > To: Core developer announcements
>> > > Subject: Re: [MOBY-dev] XML parsing takes forever
>> > >
>> > > Hi,
>> > >
>> > > here you go,
>> > >
>> > > Kenny
>> > >
>> > > On Thu, 30 Oct 2008, Edward Kawas wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > This is the right list! Could you send me the XML for your
>> > > > second
>> > service?
>> > > > Thanks,
>> > > >
>> > > > Eddie
>> > > _______________________________________________
>> > > MOBY-dev mailing list
>> > > MOBY-dev at lists.open-bio.org
>> > > http://lists.open-bio.org/mailman/listinfo/moby-dev
>> > >
>> >
>>
>>
>
>
--
More information about the MOBY-dev
mailing list