[Biopython-dev] New: Uniprot XML parser

Peter biopython at maubp.freeserve.co.uk
Mon Jul 26 15:04:41 UTC 2010


Andrea Pierleoni wrote:
>
> Hi Everyone,
> I've been using a lot biopython in the last couple of years, it is very
> useful to me. So now it's my turn to contribute and be helpful to someone
> else.
> I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
> entries/min on a core2duo mainstream PC). The main improvements with the
> actual SwissProt flat file parser are a deeper parsing of comment fields,
> and a Seqrecord containing features.
>
> The parser is based on the ElementTree library and was successfully tested
> on the complete SwissProt database (v57.12). Thus I think it is ready to
> be released.
>
> I followed the rules to develop a new parser for SeqIO, filed an
> enhancement bug to bugzilla (bug 2992), and included the parser in a
> public biopython fork on github available at:
>
> http://github.com/apierleoni/biopython/tree/uniprotxml-branch
>
> the new parser is in the "uniprotxml-branch" branch, and the parser code
> is in Bio/SeqIO/UniprotIO.py
>
> The parser can be used from SeqIO using:
>
> iterator=SeqIO.parse(handle,'uniprot')
>
> I think this could be easily integrated in Biopython,  unit test is still
> missing, but should be very easy to do.
> Anyhow any code review or suggestions are welcome.
>
> Andrea

Hi Andrea,

As you have probably noticed via github, I have been trying out your code.

I noticed you hadn't implemented indexing support so I have done this on
my branch as a quick hack:

http://github.com/peterjc/biopython/commits/uniprot

What I want to be able to do is seek to the start of an <entry ...> in the
XML handle, and have the parser continue from that point. I've done this
by the nasty trick of extracting the record from the XML file as a string
(using the get_raw method of the index class), then adding the XML
header and footer to it, and then invoking your parser. There should
be a better way to do this, but I am not familiar enough with
ElementTree to see it right away. Can you improve on this?

I'd also like to have SeqFeature parsing done for the plain text "swiss"
parser as well, which can double as a cross check for your parser. Did you
look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235

We should also run a comparison test of the "swiss" plain text and
"uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot
and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads

Peter




More information about the Biopython-dev mailing list