[Biopython] Entrez.parse error

Michiel de Hoon mjldehoon at yahoo.com
Thu Dec 22 05:59:46 UTC 2016


> Entrez.parse was written for a reason, to parse complex xml data so that it easy to extract citation data from it.> Entrez.read, does indeed work, but the output contains such a complex data structure, it is a non-trivial exercise to parse it.
There is only one difference between Entrez.parse and Entrez.read: Entrez.read parses the whole data at once, while Entrez.parse iterates through the data.There is no difference in the complexity of the data structure returned by Entrez.parse and Entrez.read: in both cases, the data structure is consistent with what NCBI specifies in the DTD referenced in the XML.Now, Entrez.parse only makes sense if the data structure returned by NCBI corresponds to a list in Python. If it doesn't, then iterating through the XML data makes no sense, and you should use Entrez.read instead.
In this particular case, NCBI has changed the data structure such that Entrez.parse is not appropriate and you should use Entrez.read instead. This does not mean that Entrez.parse is broken.
We do need to update the Biopython documentation though.
Best,-Michiel.


    On Thursday, December 22, 2016 12:47 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
 

 On Wed, Dec 21, 2016 at 7:47 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> In what sense is the current result from Entrez.read more difficult to parse
> than the previous result from Entrez.parse?
> As far as I can tell, Entrez.read and Entrez.parse are both working
> correctly.
> Best,
> -Michiel

In this example we expected a list-like structure with an
entry for each record requested (here two), allowing
iteration over these records with Entrez.parse as in the
original example:

from Bio import Entrez
Entrez.email = "Your.Name.Here at example.org"
handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml")
records = Entrez.parse(handle)
for record in records:
    print(record['MedlineCitation']['Article']['ArticleTitle’])

That no longer works - it seems the Entrez parsing code no
longer thinks what the NCBI returns is list-like, and so
Entrez.parse rejects it, saying using Entrez.read to load
everything at once.

This works perfectly with our Tests/Entrez/pubmed2.xml
example file (also two PubMed articles), and at first glance
the XML structure is the same (other than the DTD update).

The top level XML tag's DTD has changed slightly:

<!ELEMENT PubmedArticleSet (PubmedArticle | PubmedBookArticle)+>

Now with pubmed_170101.dtd this can be a deletion:

<!ELEMENT PubmedArticleSet ((PubmedArticle | PubmedBookArticle)+,
DeleteCitation?) >

I remain puzzled about what exactly has changed here.

Peter

   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20161222/2612c387/attachment.html>


More information about the Biopython mailing list