[BioPython] performance problem in ParserSupport.EventGenerator._get_set_flags

Brad Chapman chapmanb at uga.edu
Fri Jun 13 12:56:46 EDT 2003


Hi Andreas;

> I think, first we should find out about the failing test. 

So I think after all your mails we are sorted on this. test_GenBank
works without the patch, but the patch breaks it. 

The reason for that is because the changes I made weren't complete.
I looked at this for real during lunch and just checked in some
changes which eliminate _get_set_flags entirely (and self.flags).
All tests appear to be fine after this change, so it's checked into
CVS and the diff is attached (BTW, there are a couple of extraneous
changes in that diff -- just removing some tabs which snuck in (bad
tabs (bad tabs)).

Let me know if this works for you (and still provides the
performance enhancements).

> But there is still potential in it :-)
> At the moment, (after my optimization) about 90% of my performance test
> goes into Parser._do_callback. Of this, 60% is spend in endElement, 5%
> in startElement and 7% in characters. The remaining time is spend in
> _do_callback itself and for the recursion. So to get faster we could:
> 1. make _do_callback self faster (Don't see how)
> 2. make endElement faster.
> 3. reduce recursion somehow(?) function-calls are expensive in python.
> 4. Invent some clever algorithm

I did some clean up in endElement (getting rid of the _get_set_flags
function, mainly) so this might provide some speed-ups for that
problem. I don't have any genius ideas for the other points right
now, but maybe these simple clean-up changes will improve
performance decently.

Hopefully this code is a little cleaner (damn, it was ugly before).
If you want to send me your diffs on top of this I'm happy to
commit 'em.

Thanks again for working on this.
Brad
-------------- next part --------------
Index: ParserSupport.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/ParserSupport.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -c -r1.21 -r1.22
*** ParserSupport.py	7 Dec 2001 18:48:26 -0000	1.21
--- ParserSupport.py	13 Jun 2003 15:49:29 -0000	1.22
***************
*** 103,112 ****
  
      def _print_name(self, name, data=None):
          if data is None:
! 	    # Write the name of a section.
              self._handle.write("%s %s\n" % ("*"*self._colwidth, name))
          else:
! 	    # Write the tag and line.
              self._handle.write("%-*s: %s\n" % (
                  self._colwidth, name[:self._colwidth],
                  string.rstrip(data[:self._maxwidth-self._colwidth-2])))
--- 103,112 ----
  
      def _print_name(self, name, data=None):
          if data is None:
!             # Write the name of a section.
              self._handle.write("%s %s\n" % ("*"*self._colwidth, name))
          else:
!             # Write the tag and line.
              self._handle.write("%-*s: %s\n" % (
                  self._colwidth, name[:self._colwidth],
                  string.rstrip(data[:self._maxwidth-self._colwidth-2])))
***************
*** 195,206 ****
              self._finalizer = callback_finalizer
              self._exempt_tags = exempt_tags
  
-             # a dictionary of flags to recognize when we are in different
-             # info items
-             self.flags = {}
-             for tag in self.interest_tags:
-                 self.flags[tag] = 0
- 
              # a dictionary of content for each tag of interest
              # the information for each tag is held as a list of the lines.
              # This allows us to collect information from multiple tags
--- 195,200 ----
***************
*** 216,259 ****
              self._previous_tag = ''
  
              # the current character information for a tag
!             self._cur_content = ''
! 
!         def _get_set_flags(self):
!             """Return a listing of all of the flags which are set as positive.
!             """
!             set_flags = []
!             for tag in self.flags.keys():
!                 if self.flags[tag] == 1:
!                     set_flags.append(tag)
! 
!             return set_flags
  
          def startElement(self, name, attrs):
!             """Recognize when we are recieving different items from Martel.
! 
!             We want to recognize when Martel is passing us different items
!             of interest, so that we can collect the information we want from
!             the characters passed.
              """
!             # set the appropriate flag if we are keeping track of these flags
!             if self.flags.has_key(name):
!                 # make sure that all of the flags are being properly unset
!                 assert self.flags[name] == 0, "Flag %s not unset" % name
! 
!                 self.flags[name] = 1
  
          def characters(self, content):
!             """Extract the information.
! 
!             Using the flags that are set, put the character information in
!             the appropriate place.
              """
!             set_flags = self._get_set_flags()
! 
!             # deal with each flag in the set flags
!             for flag in set_flags:
!                 # collect up the content for all of the characters
!                 self._cur_content += content
  
          def endElement(self, name):
              """Send the information to the consumer.
--- 210,230 ----
              self._previous_tag = ''
  
              # the current character information for a tag
!             self._cur_content = []
!             # whether we should be collecting information
!             self._collect_characters = 0
  
          def startElement(self, name, attrs):
!             """Determine if we should collect characters from this tag.
              """
!             if name in self.interest_tags:
!                 self._collect_characters = 1
  
          def characters(self, content):
!             """Extract the information if we are interested in it.
              """
!             if self._collect_characters:
!                 self._cur_content.append(content)
  
          def endElement(self, name):
              """Send the information to the consumer.
***************
*** 267,276 ****
              """
              # only deal with the tag if it is something we are
              # interested in and potentially have information for
!             if name in self._get_set_flags():
                  # add all of the information collected inside this tag
!                 self.info[name].append(self._cur_content)
!                 self._cur_content = ''
                  
                  # if we are at a new tag, pass on the info from the last tag
                  if self._previous_tag and self._previous_tag != name:
--- 238,249 ----
              """
              # only deal with the tag if it is something we are
              # interested in and potentially have information for
!             if self._collect_characters:
                  # add all of the information collected inside this tag
!                 self.info[name].append("".join(self._cur_content))
!                 # reset our information and flags
!                 self._cur_content = []
!                 self._collect_characters = 0
                  
                  # if we are at a new tag, pass on the info from the last tag
                  if self._previous_tag and self._previous_tag != name:
***************
*** 278,287 ****
  
                  # set this tag as the next to be passed
                  self._previous_tag = name
- 
-                 # unset the flag for this tag so we stop collecting info
-                 # with it
-                 self.flags[name] = 0
  
          def _make_callback(self, name):
              """Call the callback function with the info with the given name.
--- 251,256 ----


More information about the BioPython mailing list