[Biopython] Parsing xml from Bioproject without DTD - how to use schema?

Peter Cock p.j.a.cock at googlemail.com
Mon May 18 10:41:56 UTC 2015


On Sun, May 17, 2015 at 9:24 PM, Anna Simpson <acsimpson at gmail.com> wrote:
> Hi all,
> I've been trying to parse xml files from an efetch query to the bioproject
> database, and kept getting an error message about no dtd (and
> validation=False gets me no data at all) when using Entrez.read or
> Entrez.parse. I found a post on this mailing list from 2013, where a
> gentleman had the same problem - he emailed NCBI and was told the following:
>
> "Yes this is the "normal" but it is an oversight as a dtd was never created
> for this database. I will have to open a ticket to the developers to create
> this and have it included in the XML and on the DTD web page."
>
> I've emailed NCBI about this again but I'm guessing there still isn't one
> (and I can't find it in the DTD index page). But my various googlings have
> led me to find that there is a schema for bioproject, and that perhaps,
> somehow, it could be used to parse these xml files. How  might I go about
> doing that?
>
> I've been trying to use xml parsers like element tree and Beautiful Soup but
> keep running into walls (how to stick an entrez handle into a parser, how to
> get it to give me deeply nested information when the nesting is different
> for each xml document I get and I'm running this through a loop) so it would
> be great if I could ...stop doing that.
>
> Thanks,
> Anna
> University of Washington, Seattle

Hi Anna,

It sounds like someone at the NCBI is aware of a problem on their end.

Can you post a short self contained snippet importing the Entrez module,
calling efetch for bioproject and trying to parse it? I'm curious to try this
with the latest Biopython (what will be v1.66) which now handles NCBI
XSD scheme files as well as DTD files [but possibly not relevant here.]

Regards,

Peter


More information about the Biopython mailing list