[Biopython] need help! how to retrieve full text from Pubmed central ?

Peter biopython at maubp.freeserve.co.uk
Tue Jan 5 11:46:34 UTC 2010


On Mon, Jan 4, 2010 at 3:15 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> This *is* supported by Biopython. In principle, Bio.Entrez can parse any
> XML generated by NCBI Entrez as long as the corresponding DTDs are
> available. In this case, the DTD included in Biopython 1.53 is corrupted,
> causing the error. Unfortunately, the correct DTD relies on a large number
> of other DTDs, so just replacing the one DTD is not sufficient.
>
> Hmm... maybe we should think of a more robust way of getting the DTDs
> without relying on their inclusion in the Biopython distribution ...

Which DTD has a problem? I was aware an elink DTD was *missing* in
Biopython 1.53 (adding in git), but not of any corrupted DTD files.

In this particular example, it is the NCBI that have a problem - they are
returning invalid XML which (understandably) our parser is rejecting.
It could just be they haven't kept the XML output and the public DTD
files in sync.

For example, consider this Entrez URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml

According to both these validators this is not a valid XML file!

http://www.validome.org/xml/validate/
http://validator.w3.org/

In Biopython when we try and parse this exact URL:

>>> from Bio import Entrez
>>> import urllib
>>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml"))
Traceback (most recent call last):
...
Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data.
Please make sure that the input data are in XML format, and that the
data are not corrupted.

You get the same error using the Bio.Entrez.efetch function which
will use an equivalent URL (but with the tool and email set):

>>> from Bio import Entrez
>>> Entrez.email = "your.name.here at example.com"
>>> record = Entrez.read(Entrez.efetch(db="pmc", id="2747014", retmode="xml"))
Traceback (most recent call last):
...
Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data.
Please make sure that the input data are in XML format, and that the
data are not corrupted.

Peter



More information about the Biopython mailing list