[Biopython] Some help to access "hidden" features :-)

Peter Cock p.j.a.cock at googlemail.com
Thu Mar 7 22:19:15 UTC 2013


On Thu, Mar 7, 2013 at 9:26 PM, Téletchéa Stéphane
<stephane.teletchea at inserm.fr> wrote:
> Dear biopythoners,
>
> I am struggling in extracting some informations from a uniprot file.
>
> a) Get the inital file, for instance
> http://www.uniprot.org/uniprot/P02724.xml
> b) parse it:
>
> python
>>>> from Bio import SeqIO
>>>> record=list(SeqIO.parse("P02724.xml",'uniprot-xml'))
>>>> print record[0].dbxrefs
> ...
>
>>>> for i in record[0].dbxrefs:
> ...     if 'PDB:' in i:
> ...             print i
> ...
> PDB:1AFO
> PDB:1MSR
> PDB:2KPE
> PDB:2KPF

Excellent - a self contained example :)

That makes it much easier for us to see what you're
doing and how to help. Thank you.

> In the Uniprot file, there are annotations for the 1AFO model:
> NMR method, starts at 81 and ends at 120.
>
> The corresponding entry in the xml file is:
>
> <dbReference type="PDB" id="1AFO">
> <property type="method" value="NMR"/>
> <property type="chains" value="A/B=81-120"/>
> </dbReference>
>
> According to the module source code
> (http://biopython.org/DIST/docs/api/Bio.SeqIO.UniprotIO-pysrc.html),
> it is possible to access these datas, they are correctly handled:
>
>         def  _parse_dbReference(element):
>             self.ParsedSeqRecord.dbxrefs.append(element.attrib['type'] + ':' + element.attrib['id'])
>             ...

As you will have seen, the SeqRecord's dbxrefs does get
populated with the key information - but this is (based on
usage in other file formats) a very simple list of strings.

Right now the extra information *is not returned*, mainly as
it doesn't naturally map onto the existing SeqRecord model.

A little later in that method you'd have seen a comment:
"TODO - How best to store these, do SeqFeatures make sense?"
and the following lines created a SeqFeature object, but never
add it to the returned SeqRecord. Elsewhere the UniProt
file does have things we store as SeqFeature objects - so
doing this for the database reference information is a bit
odd. Perhaps we'd be better off following the approach
used for references in GenBank files instead? I'm unclear
what is best (partly since I don't use these bits of data).

What do you think the parser should do with this data?

[Note that in this situation you might be better off using one
of the Python standard library modules to work with the XML
directly (e.g. ElementTree or cElementTree) if you need
all the details in the UniProt XML file which are not yet
handled in the conversion to a SeqRecord object.]

Regards,

Peter




More information about the Biopython mailing list