[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Tue Sep 25 14:39:11 UTC 2012

Hello,

Apologies for not having followed the entire discussion, but just wanted
to say that we're also using NCBIXML here and are likely to be
incorporating it in a new piece of software soon, so it would be really
unfortunate if some tags disappeared, were renamed or (even worse)
changed meaning in future releases.

I'm a bit late coming in here so maybe this has been answered, but is
there a better parser that should be used at the moment? I was under the
impression that NCBIXML is the only one.

Thanks,
Tanya

On 25/09/12 14:32, Peter Cock wrote:
> On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> OK - so there is at least one person making heaving use of the
>> NCBIXML so we shouldn't rush to deprecate it after merging
>> SearchIO, and there *is* some benefit from making it faster
>> (but with the same API).
>>
>> In principle NCBIXML would be rewritten to use cElementTree
>> /ElementTree and preserve the API - if you or anyone else want
>> to do that (and the unit tests still pass), then I'm happy to review
>> such changes. Likewise for less dramatic optimisations.
> 
> Martin emailed me to ask about this bit of the code, and it
> can be sped up - this shows about a 5% reduction:
> https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600
> 
> Summary parsing a 286MB XML file from BLASTX 2.2.26+
> for 1000 genes against the NR database.
> 
> NCBIXML before change: About 162s
> NCBIXML after change: About 154s
> NCBIXML removing debug: About 152s
> Using SearchIO: About 79s
> 
> This is probably the same test file Bow gave numbers for earlier,
> although it seems SearchIO has less of an advantage on my
> machine (about x2) compared to Bow's machine (almost x5).
> 
> (We should check memory usage too...)
> 
> Peter
> 
> ---------------------------------------------
> 
> The full details,
> 
> Before this change:
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 161.8s
> 
> real	2m41.894s
> user	2m41.208s
> sys	0m0.675s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 161.8s
> 
> real	2m41.984s
> user	2m41.296s
> sys	0m0.677s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 162.6s
> 
> real	2m42.771s
> user	2m41.995s
> sys	0m0.763s
> 
> 
> With this change:
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 152.4s
> 
> real	2m32.582s
> user	2m31.910s
> sys	0m0.663s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 153.5s
> 
> real	2m33.680s
> user	2m32.977s
> sys	0m0.695s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 153.8s
> 
> real	2m33.931s
> user	2m33.258s
> sys	0m0.661s
> 
> And if we go further and remove _debug_ignore_list and
> this bit of debug code the saving is marginal:
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 151.5s
> 
> real	2m31.611s
> user	2m30.934s
> sys	0m0.665s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 151.2s
> 
> real	2m31.348s
> user	2m30.664s
> sys	0m0.674s
> 
> $ time python time_ncbixml.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 152.9s
> 
> real	2m32.994s
> user	2m32.314s
> sys	0m0.669s
> 
> This is the timing script I used,
> 
> $ more /tmp/time_ncbixml.py
> import sys
> import time
> from Bio.Blast import NCBIXML
> for f in sys.argv[1:]:
>     start = time.time()
>     count = 0
>     handle = open(f)
>     for record in NCBIXML.parse(handle):
>         count += 1
>     handle.close()
>     print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
> #End of file
> 
> For comparison, here is the timing on the same setup but using
> SearchIO from Bow's current branch:
> 
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 79.1s
> 
> real	1m19.259s
> user	1m18.397s
> sys	0m0.799s
> 
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 78.7s
> 
> real	1m18.878s
> user	1m18.149s
> sys	0m0.719s
> 
> $ time python time_searchio.py thousand_blastx_nr.xml
> 1000 records in thousand_blastx_nr.xml in 79.5s
> 
> real	1m19.611s
> user	1m18.683s
> sys	0m0.918s
> 
> And the script:
> 
> $ more /tmp/time_searchio.py
> import sys
> import time
> from Bio import SearchIO
> for f in sys.argv[1:]:
>     start = time.time()
>     count = 0
>     handle = open(f)
>     for record in SearchIO.parse(handle, "blast-xml"):
>         count += 1
>     handle.close()
>     print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
> #End of file
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython