[Biopython] need help! how to retrieve full text from Pubmed central ?

Tue Jan 5 07:42:10 EST 2010

On Tue, Jan 5, 2010 at 12:17 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> There are multiple issues here.
>
> First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed
> by now on github). Basically, the corrupted DTD inserts some gibberish into
> the XML, which is then no longer valid. If you replace the corrupted DTD by
> the correct one, the CorruptedXMLError goes away.

I see what you mean, our old copy of nlm-articleset-2.0.dtd was actually an
HTML redirect message. Oops. Thanks for sorting out that glitch - my fault.

> But you'll find that a bunch of other DTDs are missing (these have now been
> uploaded to github). With the complete set of DTDs, you run into a new error:

Do you get this:
NotImplementedError: The Bio.Entrez parser cannot handle XML data that
make use of XML namespaces

> One of the tags in the XML file is not listed anywhere in any of the DTDs.
> This is probably the reason the XML validators show that it's not valid XML.
> I've notified NCBI that the XML output is not consistent with the DTDs for
> this case.

Excellent - thank you.

Peter

P.S. Last year (Sept 2009) I reported a similar problem with ELink XML
failing to validate when the history was used (while working on the
"Searching for citations" example in the tutorial). That seems to be
resolved now so I can update the tutorial...