[Biopython-dev] Uniprot XML parser on TrEmbl

Peter biopython at maubp.freeserve.co.uk
Fri Nov 12 10:29:51 UTC 2010


On Fri, Nov 12, 2010 at 10:24 AM, Andrea Pierleoni wrote:
> WIth the submitted patch the parser was able to correctly parse 12.347.303
> entries in the 62Gb XML file in 2h 13m.

That's good - but I thought the patch broke the unit test so I reverted it
last night. I'll double check this.

> it looks like a reasonable performance to me, since you are going to spend
> more time in downloading the 8Gb gzipped file and decompressing it.

On the other hand, you only download it once, and will probably only
decompress it once (although you can parse gzipped files from within
python if you want to), but you will parse it many times.

My point is it probably could be made faster (if anyone wanted to spend
the time), but it is fast enough already to be useful, and worth having
in Biopython :)

Peter



More information about the Biopython-dev mailing list