[Biopython] NCBI e-utils parser upgrade
Michiel de Hoon
mjldehoon at yahoo.com
Fri Nov 21 01:20:14 UTC 2014
Hi Ivan,
I am the original author of Bio.Entrez.
The parser in Bio.Entrez consists of two parts: The XML parser and the DTD parser.
The DTD parser is used to determine how the elements in the XML file should be represented in Python.
To allow schemas, all that is needed is to write a parser for the schema; the XML parser is unchanged.
In Bio/Entrez/Parser.py, you will find the method startNamespaceDeclHandler;
currently it just raises a NotImplementedError.
If you try the Bio.Entrez parser on your XML file, you will see that this error gets raised.
So all you would have to do is to implement startNamespaceDeclHandler;
it should parallel externalEntityRefHandler, which parses DTD files, though the bulk of the work is done in elementDecl.
Please let me know if you run into any problems.
Best,
-Michiel.
--------------------------------------------
On Fri, 11/21/14, Ivan Erill <ivan.erill at gmail.com> wrote:
Subject: [Biopython] NCBI e-utils parser upgrade
To: biopython at mailman.open-bio.org
Date: Friday, November 21, 2014, 2:42 AM
Hi all,
As part of my
work, I need to deal with the new WP protein records at NCBI
and, specifically, with the information on their coding
sequences. This information is returned by E-utils through a
an integrated protein report type of view:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=231025&rettype=ipg
which does not use
a DTD for the XML, but rather a schema. Although there has
been no formal announcement, I've been talking to NCBI
people and they tell me that they will progressively be
moving to schemas (which provide more fine grained
validation specification). Specifically, all new XML exports
from NCBI will be using schemas. I don't believe that
existing DTDs are going to be replaced by schemas for
now.
My original
through was to branch an update for the current XML parser
in BioPython, but it looks like using schemas would be a
major overhaul of the existing code-base and it might make
more sense to develop a parallel parser, so I first wanted
to check on what approach you guys would prefer to do
code-wise.
Regards,
Ivan
-----Inline Attachment Follows-----
_______________________________________________
Biopython mailing list - Biopython at mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list