[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Tanya Golubchik
golubchi at stats.ox.ac.uk
Tue Sep 25 14:39:11 UTC 2012
Hello,
Apologies for not having followed the entire discussion, but just wanted
to say that we're also using NCBIXML here and are likely to be
incorporating it in a new piece of software soon, so it would be really
unfortunate if some tags disappeared, were renamed or (even worse)
changed meaning in future releases.
I'm a bit late coming in here so maybe this has been answered, but is
there a better parser that should be used at the moment? I was under the
impression that NCBIXML is the only one.
Thanks,
Tanya
On 25/09/12 14:32, Peter Cock wrote:
> On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> OK - so there is at least one person making heaving use of the
>> NCBIXML so we shouldn't rush to deprecate it after merging
>> SearchIO, and there *is* some benefit from making it faster
>> (but with the same API).
>>
>> In principle NCBIXML would be rewritten to use cElementTree
>> /ElementTree and preserve the API - if you or anyone else want
>> to do that (and the unit tests still pass), then I'm happy to review
>> such changes. Likewise for less dramatic optimisations.
>
> Martin emailed me to ask about this bit of the code, and it
> can be sped up - this shows about a 5% reduction:
> https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600
>
> Summary parsing a 286MB XML file from BLASTX 2.2.26+
> for 1000 genes against the NR database.
>
> NCBIXML before change: About 162s
> NCBIXML after change: About 154s
> NCBIXML removing debug: About 152s
> Using SearchIO: About 79s
>
> This is probably the same test file Bow gave numbers for earlier,
> although it seems SearchIO has less of an advantage on my
> machine (about x2) compared to Bow's machine (almost x5).
>
> (We should check memory usage too...)
>
> Peter
>
> ---------------------------------------------
>
> The full details,
>
> Before this change:
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 161.8s
>
> real 2m41.894s
> user 2m41.208s
> sys 0m0.675s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 161.8s
>
> real 2m41.984s
> user 2m41.296s
> sys 0m0.677s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 162.6s
>
> real 2m42.771s
> user 2m41.995s
> sys 0m0.763s
>
>
> With this change:
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 152.4s
>
> real 2m32.582s
> user 2m31.910s
> sys 0m0.663s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 153.5s
>
> real 2m33.680s
> user 2m32.977s
> sys 0m0.695s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 153.8s
>
> real 2m33.931s
> user 2m33.258s
> sys 0m0.661s
>
> And if we go further and remove _debug_ignore_list and
> this bit of debug code the saving is marginal:
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 151.5s
>
> real 2m31.611s
> user 2m30.934s
> sys 0m0.665s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 151.2s
>
> real 2m31.348s
> user 2m30.664s
> sys 0m0.674s
>
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 152.9s
>
> real 2m32.994s
> user 2m32.314s
> sys 0m0.669s
>
> This is the timing script I used,
>
> $ more /tmp/time_ncbixml.py
> import sys
> import time
> from Bio.Blast import NCBIXML
> for f in sys.argv[1:]:
> start = time.time()
> count = 0
> handle = open(f)
> for record in NCBIXML.parse(handle):
> count += 1
> handle.close()
> print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
> #End of file
>
> For comparison, here is the timing on the same setup but using
> SearchIO from Bow's current branch:
>
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 79.1s
>
> real 1m19.259s
> user 1m18.397s
> sys 0m0.799s
>
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 78.7s
>
> real 1m18.878s
> user 1m18.149s
> sys 0m0.719s
>
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 79.5s
>
> real 1m19.611s
> user 1m18.683s
> sys 0m0.918s
>
> And the script:
>
> $ more /tmp/time_searchio.py
> import sys
> import time
> from Bio import SearchIO
> for f in sys.argv[1:]:
> start = time.time()
> count = 0
> handle = open(f)
> for record in SearchIO.parse(handle, "blast-xml"):
> count += 1
> handle.close()
> print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
> #End of file
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list