[Biopython] NCBI e-utils parser upgrade

Ivan Erill ivan.erill at gmail.com
Thu Nov 20 17:42:37 UTC 2014


Hi all,

As part of my work, I need to deal with the new WP protein records at NCBI
and, specifically, with the information on their coding sequences. This
information is returned by E-utils through a an integrated protein report
type of view:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=231025&rettype=ipg

which does not use a DTD for the XML, but rather a schema. Although there
has been no formal announcement, I've been talking to NCBI people and they
tell me that they will progressively be moving to schemas (which provide
more fine grained validation specification). Specifically, all new XML
exports from NCBI will be using schemas. I don't believe that existing DTDs
are going to be replaced by schemas for now.

My original through was to branch an update for the current XML parser in
BioPython, but it looks like using schemas would be a major overhaul of the
existing code-base and it might make more sense to develop a parallel
parser, so I first wanted to check on what approach you guys would prefer
to do code-wise.

Regards,

Ivan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20141120/f29249da/attachment.html>


More information about the Biopython mailing list