[Biopython-dev] Parsing efetch results from the Journals database through Bio.Entrez

Fri Sep 3 17:26:51 UTC 2010

Hi everybody,

The parser in Bio.Entrez can parse any XML returned by the Entrez E-utilities as long as the corresponding DTD is available (which are included with each release of Biopython). One corner case is efetch results from the Journals database. Officially, efetch from the Journals database does not generate output in the XML format, but only plain text or HTML. However, when requesting XML explicitly from Entrez, in practice it does return an XML-like output. Our parser in Bio.Entrez is able to parse this XML, but it requires several hacks in the parser code.

As probably few users are interested in efetch output from the Journals database, I suggest that we remove these hacks from Bio.Entrez altogether -- after all, this is for XML that is not supported by NCBI to begin with. If there are some users that really want to parse efetch output from the Journals database, we can always add a simple parser for plain-text efetch output.

The advantage of removing these hacks is that it will allow us to validate all XML against the DTD, and to raise an error (if the user requests so) if any elements are found in the XML that don't validate against the DTD.

Any objections?

--Michiel.