[Biopython-dev] Uniprot XML parser on TrEmbl

Peter biopython at maubp.freeserve.co.uk
Fri Nov 12 12:00:42 UTC 2010


On Fri, Nov 12, 2010 at 11:05 AM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>
>> That's good - but I thought the patch broke the unit test so I reverted it
>> last night. I'll double check this.
>>
>
> yes I've seen it in github, can you fix it?
>

Probably. I'll make time to look at it before the Biopython 1.56 release
(which is unlikely to happen this week, delayed by the identification of
some problems running under Jython on Windows).

>> On the other hand, you only download it once, and will probably only
>> decompress it once (although you can parse gzipped files from within
>> python if you want to), but you will parse it many times.
>>
>
> well, if your looking to performance, you're not scanning a 62Gb file
> each time you search for an entry, but your going to index it. the of
> course it depends on what you are doing... but, given the monthly
> release, maybe you're downloading and decompressing (or parsing
> a compressed file) once a month.

Yeah, it depends.

>> My point is it probably could be made faster (if anyone wanted to spend
>> the time), but it is fast enough already to be useful, and worth having
>> in Biopython :)
>
> Yes, I hope it can be made faster, but I have no idea about this, since
> the process is very straightforward. I did not make any profiling of the
> parser, so I cannot exclude some bottleneck.

That would be worth while at some point.

> the only obvious speed up would be using the multiprocessing library in
> multi-cpu system, but I've never seen it used in biopython.

We haven't been able to due to the Python 2.4 requirement, but
I know of people using Biopython and multiprocessing together.

> It should be really easy to implement, and maybe we can think about
> it after python 2.4 support is dropped.  as far as i know, multiprocessing
> is included in python 2.6 and available in python  2.5.

Personally I'd try profiling the current single threaded code before
going to multiprocessing.

> On the other hand, Biopython has the fastest uniprot XML parse
> among Bio* projects and (to my knowledge) the fastest public
> parser on the planet ;) I bet Uniprot guys have their parser...

Which of the other Bio* projects have a Uniprot XML parser?
(Or was that intended as a joke?)

Peter




More information about the Biopython-dev mailing list