[Biopython] PubmedCentral XML parsing

Peter Cock p.j.a.cock at googlemail.com
Mon Apr 29 11:23:16 UTC 2013


On Thu, Apr 25, 2013 at 8:16 PM, Paulo Nuin <nuin at genedrift.org> wrote:
> Hi Peter
>
> Thanks a lot. I am getting an error when trying to parse with
> Entrez.parse. I download the nxml file prior to parsing, using PMC's FTP
> server in order to avoid their bulk downloading restrictions. Anyway, the
> code I am using is quite simple (with ipython):
>
> In [1]: from Bio import Entrez
>
> In [2]: handle = open('nihms83342.nxml')
>
> In [3]: records = Entrez.parse(handle)
>
> In [4]: for i in records:
>    ...:     print i
>    ...:
>
> ---------------------------------------------------------------------------
> NotXMLError                               Traceback (most recent call
> last)
> <ipython-input-4-82461854c9e7> in <module>()
> ----> 1 for i in records:
>       2     print i
>       3
>
> /Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self,
> handle)
>     229                         # We did not see the initial <!xml
> declaration, so
>     230                         # probably the input data is not in XML
> format.
> --> 231                         raise NotXMLError("XML declaration not
> found")
>     232                 self.parser.Parse("", True)
>     233                 self.parser = None
>
> NotXMLError: Failed to parse the XML data (XML declaration not found).
> Please make sure that the input data are in XML format.
>
> And the file header is
>
> <?xml version="1.0"?>
> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange
> DTD v2.3 20070202//EN" "archivearticle.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"
> xmlns:mml="http://www.w3.org/1998/Math/MathML"
> article-type="research-article" xml:lang="EN">
>         <?properties open_access?>
>         <?properties manuscript?>
>         <front>
>                 <journal-meta>
>
> Is there a different way of parsing this file?
>
> Thanks in advance
>
> Paulo

Hi Paulo,

The header you've shown here does not match the file you
attached to the bug report (the where first line is missing
and there seem to be no line breaks either):
https://redmine.open-bio.org/issues/3430

Where exactly did the nihms83342.nxml file come from?
Is there a URL we can download it from to check?

Thanks,

Peter



More information about the Biopython mailing list