[Biopython-dev] Bio.Entrez XML parsing

Tue Apr 1 14:23:29 UTC 2008

> Do you only intend to support Entrez XML files with this read()
> function, or potentially other formats too?

As all of Entrez's EUtils can return XML output (with many of them returning XML only), I was thinking of parsing XML files only. EUtils output in one of the sequence formats ought to be parsed by Bio.SeqIO. I am not sure if there are any other major file formats that we should handle. We can think about that later if and when the need arises.

> Even for the assorted XML formats, I'm not yet clear on how you
> imaging this being extended.

This I am not clear on either; I just added this in response to Sean's request so we have some concrete code to look at. Sean, could you give an example of how you would extend (this or a different) parser?

> Have you had a chance to look at Eric's Entrez Taxonomy XML
> parser?  It would need some re-factoring to fit in (see attachments
> on Bug 2475).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2475

Eric uses a DOM parser, while I am using a SAX parser. DOM parsers have the advantage that they allow modification of the XML tree, whereas SAX just goes through the XML in one pass. SAX is preferable for large files, since DOM keeps the full XML file in memory, but maybe it is not so relevant for NCBI's EUtils. Anyway, if the end result is a Python object representing the XML, it doesn't matter much whether we go through DOM or SAX. Eric, do you have a strong preference for DOM?

Once we have the basic framework for the Bio.Entrez parser settled, we can merge it with Eric's code.

--Michiel

---------------------------------
You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.