[Biopython-dev] latest checking

Andrew Dalke adalke at mindspring.com
Tue Jan 22 20:32:34 EST 2002


Okay, I've updated CVS to my latest working version of
Martel and the Bioformat code.

Most of this commit was related to performance.  I can
go into details, but it isn't worth it since it isn't
there any more.

I can currently convert sprot38.dat from SWISS-PROT to
a full SeqRecord in just under 20 minutes (extrapolated).
This is down from about 28 minutes, so roughly 30% faster.
If I really had to, I figured out a hack approach that
might shave another few minutes off the time.

The ncbi blast parser takes about 4.7 seconds to parse
my main test file while Jeff's code takes 3.7 seconds.
I managed to whittle it down from 15 seconds, so I'm
pretty happy.  The hack approach might get me another
half second, but I'm not doing quite everything Jeff
does so no predictions that I can match his performance.

The biggest performance problem is Python's function call
overhead.  For some thing like
   <name>Andrew</name>
there are normally at least three function calls
   startElement("name", {})
   characters("Andrew")
   endElement("name)

The ContentHandler can be written as a chain of if/else
statements,
  def startElement(self, tag, attrs):
    if tag == ..
      ...
    elif tag == "name":
      self.s = ""
      self.save_characters = 1
    ...
  def characters(self, s):
    if self.save_characters:
      self.s += s
  def endElement(self, tag):
    ...
    elif tag == "name":
      print "I have", self.s
      self.save_characters = 0

The problem with this is the comparison tests ("==")
are linear in the number of tags.  Even if sorted to
present most common tags first, there's still quite
a bit of overhead -- perhaps a few dozen equality tests
for a handful of lines of "real" code.

One thing I tried last year was a "dispatch" handler,
which looks like this

  def startElement(self, tag, attrs):
    f = getattr(self, "start_" + tag)
    if f is not None:
      f(tag, attrs)

  def endElement(self, tag):
    f = getattr(self, "end_" + tag)
    if f is not None:
      f(tag, attrs)
  ...
  def start_name(self, tag, attrs):
    self.s = ""
    self.save_characters = 1
  def end_name(self, tag):
    print "I have", self.s
    self.save_characters = 0

This replaces the equality tests with a getattr - which
is a dictionary lookup - and an extra function call. 
This turns out to be faster, at least in my test.

In the latest Martel distribution, I've one-upped this.
When the Dispatch.Dispatcher() starts up, it introspects
itself and finds all the method names which start with
"start_" and "end_", then builds a table mapping tag
name to function, so the dispatch doesn't need to create
an intermediate string

  def startElement(self, tag, attrs):
    f = self._start_table.get(tag)
    if f is not None:
      f(tag, attrs)

I then tweaked the Martel.Parser so if the ContentHandler
is a Dispatcher then specialized parser code reaches
into the class to get its _start_table and _end_table.
(A la a C++ "friend" class.)  This reduces function call
overhead in two ways:
  - there isn't the intermediate startElement/endElement
     call
  - if there's no handler for that tag then there are no
     function calls at all.

These tricks really make an appreciable performance
difference, don't change the normal API, and isn't all
that hard to understand, which is why I can justify
breaking some OO boundaries.

I think there's one other performance problem in the
code, which is that state information is passed around
through class attributes.  Attribute lookups go through
a dict lookup while regular local variables are constant
time lookup and quite a bit faster.  I can't think of
any way to get around this, except that new-style Python
objects support __slots__ which might be faster.

                    Andrew
                    dalke at dalkescientific.com





More information about the Biopython-dev mailing list