[Biopython] Entrez.parse error

Wed Dec 21 14:13:35 UTC 2016

OK, here is a partial solution based on Entrez.read:

from Bio import Entrez
Entrez.email = "Your.Name.Here at example.org <mailto:Your.Name.Here at example.org>"
pmid = '12345'
handle = Entrez.efetch("pubmed", id=str(pmid), retmode="xml")
records = Entrez.read(handle)
print records['PubmedArticle'][0]['MedlineCitation']['Article']['ArticleTitle']
A new granulation method for compressed tablets [proceedings].

The title instead of an error message!

So it appears that NCBI has tweaked the data structure. Entrez.parse is still broken and I haven’t quite figured out how to fix it.

Konrad

> On 21 Dec 2016, at 09:27, Konrad Koehler <konrad.koehler at mac.com> wrote:
> 
> Entrez.parse was written for a reason, to parse complex xml data so that it easy to extract citation data from it. Entrez.read, does indeed work, but the output contains such a complex data structure, it is a non-trivial exercise to parse it.
> 
> Entrez.parse was working for a very long time, but is no longer working.  Try the following example from Biopython documentation <http://biopython.org/DIST/docs/api/Bio.Entrez-module.html>:
> 
> from Bio import Entrez
> Entrez.email = "Your.Name.Here at example.org <mailto:Your.Name.Here at example.org>"
> handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml")
> records = Entrez.parse(handle)
> for record in records:
>     print(record['MedlineCitation']['Article']['ArticleTitle’])
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/user/anaconda/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 302, in parse
>     raise ValueError("The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse")
> ValueError: The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse
> 
> I have reproduced this error on Mac OS X and also a Linux machine.  Peter has also reproduced the problem.
> 
> Can you rewrite the above example so that it works with Entrez.read to print out the “ArticleTitle” data?  A better solution of course is to fix Entrez.parse.  I have tried myself to fix this problem, but I am stumped.
> 
> Konrad
> 
> 
>> On 21 Dec 2016, at 08:47, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>> 
>> In what sense is the current result from Entrez.read more difficult to parse than the previous result from Entrez.parse?
>> As far as I can tell, Entrez.read and Entrez.parse are both working correctly.
>> Best,
>> -Michiel
>> 
>> 
>> On Tuesday, December 20, 2016 1:43 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>> 
>> 
>> Then how does one parse the output? Entrez.parse used to work, but no longer. Apparently NCBI has made changes to their xml that has broken Entrez.parse. Entrez.read returns a complex data structure that is difficult to parse.
>> If one adds "['PubmedArticle']" to line 302 of /Bio/Entrez/Parse.py so that it reads:
>> records = self.stack[0]['PubmedArticle']
>> this suppresses the error message, but it mysteriously returns only the strings "PubmedArticle" and "PubmedBookArticle" and not the citation. Any ideas?
>> 
>> Konrad
>> 
>>> On 20 Dec 2016, at 05:16, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>>> 
>>> Entrez.read works for me for the example shown.
>>> 
>>> Best,
>>> -Michiel
>>> 
>>> 
>>> On Sunday, December 18, 2016 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>> 
>>> 
>>> On Sun, Dec 18, 2016 at 2:50 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>> > On Thu, Dec 15, 2016 at 7:37 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>>> >> Hello everyone,
>>> >>
>>> >> I have been using Entrez.parse for years without any errors.  However just
>>> >> in the last day or two, it stopped working.  I have been able to reproduce
>>> >> the error using the following example from the biopython Package Entrez
>>> >> documentation:
>>> >>
>>> >
>>> > I can reproduce this. The XML looks sensible, two <PubmedArticle>
>>> > tags:
>>> >
>>> > <?xml version="1.0" ?>
>>> > <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
>>> > January 2017//EN"
>>> > "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>">
>>> > <PubmedArticleSet>
>>> > <PubmedArticle>
>>> >    <MedlineCitation Status="MEDLINE" Owner="NLM">
>>> >        <PMID Version="1">19304878</PMID>
>>> >        ...
>>> >    </MedlineCitation>
>>> >    <PubmedData>
>>> >        ...
>>> >    </PubmedData>
>>> > </PubmedArticle>
>>> > <PubmedArticle>
>>> >    <MedlineCitation Status="MEDLINE" Owner="NLM">
>>> >        <PMID Version="1">14630660</PMID>
>>> >        ...
>>> >    </MedlineCitation>
>>> >    <PubmedData>
>>> >        ...
>>> >    </PubmedData>
>>> > </PubmedArticle>
>>> > </PubmedArticleSet>
>>> >
>>> > Note however it is using a new DTD file for Jan 2017,
>>> >
>>> > https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>
>>> >
>>> >
>>> >> Does anyone have any suggestions on how to get Entrez.parse working again? I
>>> >> am also curious why this stopped working.  Has the NCBI server changed?
>>> >>
>>> >
>>> > I would guess that the NCBI changed something subtly. Michiel?
>>> >
>>> > Peter
>>> 
>>> Logged on GitHub,
>>> 
>>> https://github.com/biopython/biopython/issues/1027 <https://github.com/biopython/biopython/issues/1027>
>>> 
>>> 
>>> Peter
>>> 
>>> 
>> 
>> 
>> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20161221/8130a36f/attachment.html>