[Biopython] need help! how to retrieve full text from Pubmed central ?

Tue Jan 5 12:17:33 UTC 2010

There are multiple issues here.

First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed by now on github). Basically, the corrupted DTD inserts some gibberish into the XML, which is then no longer valid. If you replace the corrupted DTD by the correct one, the CorruptedXMLError goes away. But you'll find that a bunch of other DTDs are missing (these have now been uploaded to github). With the complete set of DTDs, you run into a new error: One of the tags in the XML file is not listed anywhere in any of the DTDs. This is probably the reason the XML validators show that it's not valid XML. I've notified NCBI that the XML output is not consistent with the DTDs for this case.

--Michiel

--- On Tue, 1/5/10, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] need help! how to retrieve full text from Pubmed  central ?
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: biopython at lists.open-bio.org, "Brad Chapman" <chapmanb at 50mail.com>
> Date: Tuesday, January 5, 2010, 6:46 AM
> On Mon, Jan 4, 2010 at 3:15 PM,
> Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> >
> > This *is* supported by Biopython. In principle,
> Bio.Entrez can parse any
> > XML generated by NCBI Entrez as long as the
> corresponding DTDs are
> > available. In this case, the DTD included in Biopython
> 1.53 is corrupted,
> > causing the error. Unfortunately, the correct DTD
> relies on a large number
> > of other DTDs, so just replacing the one DTD is not
> sufficient.
> >
> > Hmm... maybe we should think of a more robust way of
> getting the DTDs
> > without relying on their inclusion in the Biopython
> distribution ...
> 
> Which DTD has a problem? I was aware an elink DTD was
> *missing* in
> Biopython 1.53 (adding in git), but not of any corrupted
> DTD files.
> 
> In this particular example, it is the NCBI that have a
> problem - they are
> returning invalid XML which (understandably) our parser is
> rejecting.
> It could just be they haven't kept the XML output and the
> public DTD
> files in sync.
> 
> For example, consider this Entrez URL:
> 
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml
> 
> According to both these validators this is not a valid XML
> file!
> 
> http://www.validome.org/xml/validate/
> http://validator.w3.org/
> 
> In Biopython when we try and parse this exact URL:
> 
> >>> from Bio import Entrez
> >>> import urllib
> >>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml"))
> Traceback (most recent call last):
> ...
> Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the
> XML data.
> Please make sure that the input data are in XML format, and
> that the
> data are not corrupted.
> 
> You get the same error using the Bio.Entrez.efetch function
> which
> will use an equivalent URL (but with the tool and email
> set):
> 
> >>> from Bio import Entrez
> >>> Entrez.email = "your.name.here at example.com"
> >>> record = Entrez.read(Entrez.efetch(db="pmc",
> id="2747014", retmode="xml"))
> Traceback (most recent call last):
> ...
> Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the
> XML data.
> Please make sure that the input data are in XML format, and
> that the
> data are not corrupted.
> 
> Peter
>