[Biopython] Entrez.parse error
Konrad Koehler
konrad.koehler at mac.com
Wed Dec 21 14:13:35 UTC 2016
OK, here is a partial solution based on Entrez.read:
from Bio import Entrez
Entrez.email = "Your.Name.Here at example.org <mailto:Your.Name.Here at example.org>"
pmid = '12345'
handle = Entrez.efetch("pubmed", id=str(pmid), retmode="xml")
records = Entrez.read(handle)
print records['PubmedArticle'][0]['MedlineCitation']['Article']['ArticleTitle']
A new granulation method for compressed tablets [proceedings].
The title instead of an error message!
So it appears that NCBI has tweaked the data structure. Entrez.parse is still broken and I haven’t quite figured out how to fix it.
Konrad
> On 21 Dec 2016, at 09:27, Konrad Koehler <konrad.koehler at mac.com> wrote:
>
> Entrez.parse was written for a reason, to parse complex xml data so that it easy to extract citation data from it. Entrez.read, does indeed work, but the output contains such a complex data structure, it is a non-trivial exercise to parse it.
>
> Entrez.parse was working for a very long time, but is no longer working. Try the following example from Biopython documentation <http://biopython.org/DIST/docs/api/Bio.Entrez-module.html>:
>
> from Bio import Entrez
> Entrez.email = "Your.Name.Here at example.org <mailto:Your.Name.Here at example.org>"
> handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml")
> records = Entrez.parse(handle)
> for record in records:
> print(record['MedlineCitation']['Article']['ArticleTitle’])
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/user/anaconda/lib/python2.7/site-packages/Bio/Entrez/Parser.py", line 302, in parse
> raise ValueError("The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse")
> ValueError: The XML file does not represent a list. Please use Entrez.read instead of Entrez.parse
>
> I have reproduced this error on Mac OS X and also a Linux machine. Peter has also reproduced the problem.
>
> Can you rewrite the above example so that it works with Entrez.read to print out the “ArticleTitle” data? A better solution of course is to fix Entrez.parse. I have tried myself to fix this problem, but I am stumped.
>
> Konrad
>
>
>> On 21 Dec 2016, at 08:47, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>>
>> In what sense is the current result from Entrez.read more difficult to parse than the previous result from Entrez.parse?
>> As far as I can tell, Entrez.read and Entrez.parse are both working correctly.
>> Best,
>> -Michiel
>>
>>
>> On Tuesday, December 20, 2016 1:43 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>>
>>
>> Then how does one parse the output? Entrez.parse used to work, but no longer. Apparently NCBI has made changes to their xml that has broken Entrez.parse. Entrez.read returns a complex data structure that is difficult to parse.
>> If one adds "['PubmedArticle']" to line 302 of /Bio/Entrez/Parse.py so that it reads:
>> records = self.stack[0]['PubmedArticle']
>> this suppresses the error message, but it mysteriously returns only the strings "PubmedArticle" and "PubmedBookArticle" and not the citation. Any ideas?
>>
>> Konrad
>>
>>> On 20 Dec 2016, at 05:16, Michiel de Hoon <mjldehoon at yahoo.com <mailto:mjldehoon at yahoo.com>> wrote:
>>>
>>> Entrez.read works for me for the example shown.
>>>
>>> Best,
>>> -Michiel
>>>
>>>
>>> On Sunday, December 18, 2016 11:57 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>>
>>>
>>> On Sun, Dec 18, 2016 at 2:50 AM, Peter Cock <p.j.a.cock at googlemail.com <mailto:p.j.a.cock at googlemail.com>> wrote:
>>> > On Thu, Dec 15, 2016 at 7:37 PM, Konrad Koehler <konrad.koehler at mac.com <mailto:konrad.koehler at mac.com>> wrote:
>>> >> Hello everyone,
>>> >>
>>> >> I have been using Entrez.parse for years without any errors. However just
>>> >> in the last day or two, it stopped working. I have been able to reproduce
>>> >> the error using the following example from the biopython Package Entrez
>>> >> documentation:
>>> >>
>>> >
>>> > I can reproduce this. The XML looks sensible, two <PubmedArticle>
>>> > tags:
>>> >
>>> > <?xml version="1.0" ?>
>>> > <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
>>> > January 2017//EN"
>>> > "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>">
>>> > <PubmedArticleSet>
>>> > <PubmedArticle>
>>> > <MedlineCitation Status="MEDLINE" Owner="NLM">
>>> > <PMID Version="1">19304878</PMID>
>>> > ...
>>> > </MedlineCitation>
>>> > <PubmedData>
>>> > ...
>>> > </PubmedData>
>>> > </PubmedArticle>
>>> > <PubmedArticle>
>>> > <MedlineCitation Status="MEDLINE" Owner="NLM">
>>> > <PMID Version="1">14630660</PMID>
>>> > ...
>>> > </MedlineCitation>
>>> > <PubmedData>
>>> > ...
>>> > </PubmedData>
>>> > </PubmedArticle>
>>> > </PubmedArticleSet>
>>> >
>>> > Note however it is using a new DTD file for Jan 2017,
>>> >
>>> > https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd <https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd>
>>> >
>>> >
>>> >> Does anyone have any suggestions on how to get Entrez.parse working again? I
>>> >> am also curious why this stopped working. Has the NCBI server changed?
>>> >>
>>> >
>>> > I would guess that the NCBI changed something subtly. Michiel?
>>> >
>>> > Peter
>>>
>>> Logged on GitHub,
>>>
>>> https://github.com/biopython/biopython/issues/1027 <https://github.com/biopython/biopython/issues/1027>
>>>
>>>
>>> Peter
>>>
>>>
>>
>>
>>
>
> _______________________________________________
> Biopython mailing list - Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20161221/8130a36f/attachment.html>
More information about the Biopython
mailing list