[Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly

Fri Sep 11 09:37:15 EDT 2009

Hi Michael,

I've CC'd this to the list.

On Fri, Sep 11, 2009 at 1:51 PM, Michael S. Koeris
<michael.koeris at gmail.com> wrote:
>
> Yes indeed that does help - go dyslexia....

Easily done. Actually, on looking a little closer the NCBI returned
"XML presented with HTML" (full of &lt; and &gt; entities) - still quite
unsuitable for parsing, but not actually an error page as I assumed.

> what seems to happen though is that it's not a dictionary but a list
> made up of multiple dictionaries is that right?

Probably - the Bio.Entrez parser will turn the XML nested structure into
lists and dictionaries as appropriate.

Going back to your original email, you just wanted "to parse out the
nucleic acid accession numbers from an Entrez.efetch query made
to the Gene database.", so I would actually suggest you should be
using elink instead of efetch. See for example,

http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html
http://lists.open-bio.org/pipermail/biopython/2009-August/005472.html

In your case something like this:

>>> from Bio import Entrez
>>> data = Entrez.read(Entrez.elink(db="nuccore", dbfrom="gene",id="90", retmode="xml"))
>>> for db in data :
...     print "Links for", db["IdList"], "from database", db["DbFrom"]
...     for link in db["LinkSetDb"][0]["Link"] : print link["Id"]
...
Links for ['90'] from database gene
224589811
224514625
194387497
190194409
187169269
187169268
164694819
157724517
157696421
89161198
88958353
74230050
50504351
22450871
21707501
18097079
15668129
2295237
402184
338218

Peter