[Biopython] Parsing xml from Bioproject without DTD - how to use schema?

Anna Simpson acsimpson at gmail.com
Sun May 17 20:24:41 UTC 2015


Hi all,
I've been trying to parse xml files from an efetch query to the bioproject
database, and kept getting an error message about no dtd (and
validation=False gets me no data at all) when using Entrez.read or
Entrez.parse. I found a post on this mailing list from 2013, where a
gentleman had the same problem - he emailed NCBI and was told the
following:

"Yes this is the "normal" but it is an oversight as a dtd was never created
for this database. I will have to open a ticket to the developers to create
this and have it included in the XML and on the DTD web page."

I've emailed NCBI about this again but I'm guessing there still isn't one
(and I can't find it in the DTD index page). But my various googlings have
led me to find that there is a schema for bioproject, and that perhaps,
somehow, it could be used to parse these xml files. How  might I go about
doing that?

I've been trying to use xml parsers like element tree and Beautiful Soup
but keep running into walls (how to stick an entrez handle into a parser,
how to get it to give me deeply nested information when the nesting is
different for each xml document I get and I'm running this through a loop)
so it would be great if I could ...stop doing that.

Thanks,
Anna
University of Washington, Seattle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150517/bf8e1338/attachment.html>


More information about the Biopython mailing list