[Bioperl-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL

Peter biopython at maubp.freeserve.co.uk
Thu Jan 21 12:33:53 UTC 2010


Hi all,

This is cross posted to try and ensure relevant people see it.
I suggest we continue the discussion on the BioSQL list
(for how to serialise structured annotation to BioSQL), and/or
the OpenBio list (for things like file format naming conventions).

I am hoping we (Bio*) can be consistent in how we parse and load
into BioSQL the SwissProt DE lines (known as "swiss" format in
both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
equivalent UniProt XML tags (which we are tentatively going to
call the "uniprot" format in Biopython's SeqIO - comments?).

Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
files and load them into BioSQL. Biopython currently treats the DE
comment lines as a long string, as BioPerl used to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html

I understand that BioPerl now turns the SwissProt DE lines into a
TagTree, and for storing this in BioSQL this gets serialised as XML.
I would like Biopython to handle this the same way (although rather
than a Perl TagTree, we'd use a Python structure of course), and
would appreciate clarification of what exactly was implemented
(e.g. which bit of the BioPerl source code should be look at,
and could you show a worked example?).

Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
Open-Bio lists yet) has started work on parsing UniProt XML
files for Biopython. Here the DE comment lines are already
provided broken up with XML markup. Hopefully their nested
structure matches what BioPerl was doing with the SwissProt
DE lines.

Regards,

Peter



More information about the Bioperl-l mailing list