[Biopython-dev] Uniprot XML parser on TrEmbl

Andrea Pierleoni andrea at biocomp.unibo.it
Fri Nov 12 11:05:43 UTC 2010


> That's good - but I thought the patch broke the unit test so I reverted it
> last night. I'll double check this.
>

yes I've seen it in github, can you fix it?


> On the other hand, you only download it once, and will probably only
> decompress it once (although you can parse gzipped files from within
> python if you want to), but you will parse it many times.
>

well, if your looking to performance, you're not scanning a 62Gb file each
time
you search for an entry, but your going to index it. the of course it
depends on
what you are doing... but, given the monthly release, maybe you're
downloading
and decompressing (or parsing a compressed file) once a month.

> My point is it probably could be made faster (if anyone wanted to spend
> the time), but it is fast enough already to be useful, and worth having
> in Biopython :)

Yes, I hope it can be made faster, but I have no idea about this, since
the process is very straightforward. I did not make any profiling of the
parser, so I cannot exclude some
bottleneck.
the only obvious speed up would be using the multiprocessing library in
multi-cpu
system, but I've never seen it used in biopython.
It should be really easy to implement, and maybe we can think about it
after python 2.4
support is dropped.  as far as i know, multiprocessing is included in
python 2.6 and
available in python  2.5.

On the other hand, Biopython has the fastest uniprot XML parser among Bio*
projects
and (to my knowledge) the fastest public parser on the planet ;) I bet
Uniprot guys have
their parser...

Andrea




More information about the Biopython-dev mailing list