[Biopython-dev] Re: [BioPython] performance problem in ParserSupport.EventGenerator._get_set_flags

Brad Chapman chapmanb at uga.edu
Tue Jun 10 18:09:12 EDT 2003


[moving this to the dev list since it's got attachments 'n things]

Hi Andreas;
I want to preface this by saying that I know nearly next to nothing
about profiling. That doesn't mean I'm not interested in helping
speed things up, just that I don't have much experience in doing it.

> it seems that when I parse a large GenBank file, most of the time is
> spend in this function. Can anybody think of a more efficient way to do
> this? I don't fully understand the algorithm here. What values can be
> stored in this flags? Only 0 and 1? Is there a difference between value
> 0 and not having the key in the dict?

I don't know about the efficiency issues, but I can help at least
explain what I'm trying to do here. Basically, this code is meant to
sort of transition between the way Martel does things (with XML and
SAX events) and the way that Biopython does things (with Consumers
and Event generators). So, saying that, it's going to be inherently
not optimized since I am basically jamming two concepts together
with this EventGenerator glue. Also, I suck at optimization (I try
to focus more on code clarity), so there's that as well.

The idea is that the flags keep track of where you are at in the XML
and collect up that information before passing it on to the
appropriate event. Looking at my code, it looks like a couple of
things are especially bad, but the biggest culprit is probably the
characters function:

        def characters(self, content):
            """Extract the information.

            Using the flags that are set, put the character information in
            the appropriate place.
            """
            set_flags = self._get_set_flags()

            # deal with each flag in the set flags
            for flag in set_flags:
                # collect up the content for all of the characters
                self._cur_content += content

I have zero idea why the heck I am adding the content on to the
current content multiple times. In fact, this junk code probably
only works because there is only one item for set_flags every time.
Just staring at the code, it looks like we could make self._cur_content 
a list, and then simply do:

def characters(self, content):
    """Extract the information.

    Using the flags that are set, put the character information in
    the appropriate place.
    """
    self._cur_content.append(content)

Then we could change the appropriate part of endElement to:

# add all of the information collected inside this tag
self.info[name].append("".join(self._cur_content))
self._cur_content = []

This should save a bunch of calls to _get_set_flags and won't mess
anything up, I don't believe.

Do you have time to try this and see if it speeds it up and doesn't
break anything? I'm happy to commit it but don't have time at the
moment to properly test it to make sure it doesn't break anything. I
could probably look at it over the weekend.

Let me know if this made any sense or helped. I've attached a patch
which does what I suggested above (but is completely untested).
I'm very happy to clean my code up to make it faster and do appreciate 
you looking at it.
Brad
-------------- next part --------------
*** ParserSupport.py.orig	Tue Jun 10 18:03:54 2003
--- ParserSupport.py	Tue Jun 10 18:06:59 2003
***************
*** 216,222 ****
              self._previous_tag = ''
  
              # the current character information for a tag
!             self._cur_content = ''
  
          def _get_set_flags(self):
              """Return a listing of all of the flags which are set as positive.
--- 216,222 ----
              self._previous_tag = ''
  
              # the current character information for a tag
!             self._cur_content = []
  
          def _get_set_flags(self):
              """Return a listing of all of the flags which are set as positive.
***************
*** 248,259 ****
              Using the flags that are set, put the character information in
              the appropriate place.
              """
!             set_flags = self._get_set_flags()
! 
!             # deal with each flag in the set flags
!             for flag in set_flags:
!                 # collect up the content for all of the characters
!                 self._cur_content += content
  
          def endElement(self, name):
              """Send the information to the consumer.
--- 248,254 ----
              Using the flags that are set, put the character information in
              the appropriate place.
              """
!             self._cur_content.append(content)
  
          def endElement(self, name):
              """Send the information to the consumer.
***************
*** 269,276 ****
              # interested in and potentially have information for
              if name in self._get_set_flags():
                  # add all of the information collected inside this tag
!                 self.info[name].append(self._cur_content)
!                 self._cur_content = ''
                  
                  # if we are at a new tag, pass on the info from the last tag
                  if self._previous_tag and self._previous_tag != name:
--- 264,271 ----
              # interested in and potentially have information for
              if name in self._get_set_flags():
                  # add all of the information collected inside this tag
!                 self.info[name].append("".join(self._cur_content))
!                 self._cur_content = []
                  
                  # if we are at a new tag, pass on the info from the last tag
                  if self._previous_tag and self._previous_tag != name:


More information about the Biopython-dev mailing list