[Biopython-dev] XML parsing library for new modules

Wed Apr 29 15:28:58 EDT 2009

Hi all,

I'm writing a parser for the PhyloXML format for Google Summer of Code this
year, and as the name would imply, it requires parsing some large XML files.
The existing modules in Biopython for parsing XML formats seem to use
xml.sax in the standard library. In Python 2.5, a faster and more Pythonic
parser was added to the standard lib: ElementTree (xml.etree), in
pure-Python and C-enhanced flavors. How do you feel about each of these
libraries as the basis for a new Biopython module?

Here are some interesting benchmarks:
http://effbot.org/zone/celementtree.htm#benchmarks

The ElementTree library is also available as a standalone package,
compatible back to Python 2.1, and the lxml package also offers an
independent implementation. So maintaining compatibility with Python 2.4
would require the availability of one of these third-party packages, and my
code would try each of these imports in order:

from xml.etree import cElementTree as ElementTree
from xml.etree import ElementTree
# Separate lxml package
from lxml.etree import ElementTree
# Standalone elementtree package
import cElementTree as ElementTree
from elementtree import ElementTree

Then one day, when Python 2.4 is no longer supported, only the first two
lines would be needed. (The second line is for sites that disable C
extensions, like Google App Engine, or alternate Python implementations like
Jython.)

Another option is xml.parsers.expat, but just Googling around, it appears
that the Python zeitgeist is strongly in favor of xml.etree for new code.

Thoughts?

Thanks,
Eric