[Biopython] PubmedCentral XML parsing
Peter Cock
p.j.a.cock at googlemail.com
Mon Apr 29 11:23:16 UTC 2013
On Thu, Apr 25, 2013 at 8:16 PM, Paulo Nuin <nuin at genedrift.org> wrote:
> Hi Peter
>
> Thanks a lot. I am getting an error when trying to parse with
> Entrez.parse. I download the nxml file prior to parsing, using PMC's FTP
> server in order to avoid their bulk downloading restrictions. Anyway, the
> code I am using is quite simple (with ipython):
>
> In [1]: from Bio import Entrez
>
> In [2]: handle = open('nihms83342.nxml')
>
> In [3]: records = Entrez.parse(handle)
>
> In [4]: for i in records:
> ...: print i
> ...:
>
> ---------------------------------------------------------------------------
> NotXMLError Traceback (most recent call
> last)
> <ipython-input-4-82461854c9e7> in <module>()
> ----> 1 for i in records:
> 2 print i
> 3
>
> /Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self,
> handle)
> 229 # We did not see the initial <!xml
> declaration, so
> 230 # probably the input data is not in XML
> format.
> --> 231 raise NotXMLError("XML declaration not
> found")
> 232 self.parser.Parse("", True)
> 233 self.parser = None
>
> NotXMLError: Failed to parse the XML data (XML declaration not found).
> Please make sure that the input data are in XML format.
>
> And the file header is
>
> <?xml version="1.0"?>
> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange
> DTD v2.3 20070202//EN" "archivearticle.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"
> xmlns:mml="http://www.w3.org/1998/Math/MathML"
> article-type="research-article" xml:lang="EN">
> <?properties open_access?>
> <?properties manuscript?>
> <front>
> <journal-meta>
>
> Is there a different way of parsing this file?
>
> Thanks in advance
>
> Paulo
Hi Paulo,
The header you've shown here does not match the file you
attached to the bug report (the where first line is missing
and there seem to be no line breaks either):
https://redmine.open-bio.org/issues/3430
Where exactly did the nihms83342.nxml file come from?
Is there a URL we can download it from to check?
Thanks,
Peter
More information about the Biopython
mailing list