[Biopython] NCBI e-utils parser upgrade

Mon Nov 24 16:16:51 UTC 2014

Michiel, Peter,

Thanks, for the feedback. Updating startNamespaceDeclHandler seems to be
the logical way to go. I don't have much experience with XML schemas, but I
will give it a try and make a pull request if I get something decent
working.

Ivan

On Thu, Nov 20, 2014 at 8:20 PM, Michiel de Hoon <mjldehoon at yahoo.com>
wrote:

> Hi Ivan,
>
> I am the original author of Bio.Entrez.
> The parser in Bio.Entrez consists of two parts: The XML parser and the DTD
> parser.
> The DTD parser is used to determine how the elements in the XML file
> should be represented in Python.
> To allow schemas, all that is needed is to write a parser for the schema;
> the XML parser is unchanged.
> In Bio/Entrez/Parser.py, you will find the method
> startNamespaceDeclHandler;
> currently it just raises a NotImplementedError.
> If you try the Bio.Entrez parser on your XML file, you will see that this
> error gets raised.
> So all you would have to do is to implement startNamespaceDeclHandler;
> it should parallel externalEntityRefHandler, which parses DTD files,
> though the bulk of the work is done in elementDecl.
> Please let me know if you run into any problems.
>
> Best,
> -Michiel.
>
>
>
>
> --------------------------------------------
> On Fri, 11/21/14, Ivan Erill <ivan.erill at gmail.com> wrote:
>
>  Subject: [Biopython] NCBI e-utils parser upgrade
>  To: biopython at mailman.open-bio.org
>  Date: Friday, November 21, 2014, 2:42 AM
>
>  Hi all,
>  As part of my
>  work, I need to deal with the new WP protein records at NCBI
>  and, specifically, with the information on their coding
>  sequences. This information is returned by E-utils through a
>  an integrated protein report type of view:
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=231025&rettype=ipg
>
>  which does not use
>  a DTD for the XML, but rather a schema. Although there has
>  been no formal announcement, I've been talking to NCBI
>  people and they tell me that they will progressively be
>  moving to schemas (which provide more fine grained
>  validation specification). Specifically, all new XML exports
>  from NCBI will be using schemas. I don't believe that
>  existing DTDs are going to be replaced by schemas for
>  now.
>  My original
>  through was to branch an update for the current XML parser
>  in BioPython, but it looks like using schemas would be a
>  major overhaul of the existing code-base and it might make
>  more sense to develop a parallel parser, so I first wanted
>  to check on what approach you guys would prefer to do
>  code-wise.
>  Regards,
>  Ivan
>
>  -----Inline Attachment Follows-----
>
>  _______________________________________________
>  Biopython mailing list  -  Biopython at mailman.open-bio.org
>  http://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20141124/4954570c/attachment.html>