[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Peter Cock p.j.a.cock at googlemail.com
Tue Sep 25 09:32:07 EDT 2012


On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK - so there is at least one person making heaving use of the
> NCBIXML so we shouldn't rush to deprecate it after merging
> SearchIO, and there *is* some benefit from making it faster
> (but with the same API).
>
> In principle NCBIXML would be rewritten to use cElementTree
> /ElementTree and preserve the API - if you or anyone else want
> to do that (and the unit tests still pass), then I'm happy to review
> such changes. Likewise for less dramatic optimisations.

Martin emailed me to ask about this bit of the code, and it
can be sped up - this shows about a 5% reduction:
https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600

Summary parsing a 286MB XML file from BLASTX 2.2.26+
for 1000 genes against the NR database.

NCBIXML before change: About 162s
NCBIXML after change: About 154s
NCBIXML removing debug: About 152s
Using SearchIO: About 79s

This is probably the same test file Bow gave numbers for earlier,
although it seems SearchIO has less of an advantage on my
machine (about x2) compared to Bow's machine (almost x5).

(We should check memory usage too...)

Peter

---------------------------------------------

The full details,

Before this change:

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 161.8s

real	2m41.894s
user	2m41.208s
sys	0m0.675s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 161.8s

real	2m41.984s
user	2m41.296s
sys	0m0.677s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 162.6s

real	2m42.771s
user	2m41.995s
sys	0m0.763s


With this change:

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 152.4s

real	2m32.582s
user	2m31.910s
sys	0m0.663s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 153.5s

real	2m33.680s
user	2m32.977s
sys	0m0.695s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 153.8s

real	2m33.931s
user	2m33.258s
sys	0m0.661s

And if we go further and remove _debug_ignore_list and
this bit of debug code the saving is marginal:

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 151.5s

real	2m31.611s
user	2m30.934s
sys	0m0.665s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 151.2s

real	2m31.348s
user	2m30.664s
sys	0m0.674s

$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 152.9s

real	2m32.994s
user	2m32.314s
sys	0m0.669s

This is the timing script I used,

$ more /tmp/time_ncbixml.py
import sys
import time
from Bio.Blast import NCBIXML
for f in sys.argv[1:]:
    start = time.time()
    count = 0
    handle = open(f)
    for record in NCBIXML.parse(handle):
        count += 1
    handle.close()
    print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
#End of file

For comparison, here is the timing on the same setup but using
SearchIO from Bow's current branch:

$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 79.1s

real	1m19.259s
user	1m18.397s
sys	0m0.799s

$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 78.7s

real	1m18.878s
user	1m18.149s
sys	0m0.719s

$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 79.5s

real	1m19.611s
user	1m18.683s
sys	0m0.918s

And the script:

$ more /tmp/time_searchio.py
import sys
import time
from Bio import SearchIO
for f in sys.argv[1:]:
    start = time.time()
    count = 0
    handle = open(f)
    for record in SearchIO.parse(handle, "blast-xml"):
        count += 1
    handle.close()
    print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
#End of file


More information about the Biopython mailing list