[Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez

Sat Apr 12 15:38:38 EDT 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2488

------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2008-04-12 15:38 EST -------
Created an attachment (id=904)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=904&action=view)
Bio/Entrez/PubmedArticle.py

This is a possible Bio/Entrez/PubmedArticle.py which implements an XML parser
for the PubMed database.

When constructing a dictionary to hold each publication, I am deliberately
flattening and simplifying the very deeply nested structure the NCBI uses.  In
general, do we want to provide a faithful conversion of the full XML DOM
structure into python objects, or just a simplificaton?  If the user cares
about the exact XML structure, or particular elements, they are probably better
off writing their own parsers using DOM or SAX as they see fit.

Still needs more testing, perhaps storing the dates as date objects and not as
dictionaries.  Also I am ignoring the "history" elements.

It may be worthwhile returning a Reference object (see the GenBank parser) for
these entries...

Just thinking out loud about the Bio.Entrez parsers in general:

Why don't the Bio/Entrez/XXX.py implement subclasses of Bio.Entrez.DataHandler,
rather than just the two methods startElement() and endElement() -- I'm trying
to understand why you did it this way round Michiel.

Finally, in Bio/Entrez/__init__.py why is the _NameToModule dict defined within
the DataHandler class?  This seems to prevent it from being edited -- desirable
if the user wanted to add or change the parsers called by Bio.Entrez.read() in
their script.

Peter

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.