[Biopython-dev] Martel performance

Mon Dec 11 17:03:38 EST 2000

I'm finding my PIR and the GenBank parser (the last rather modified
from Brad's because I was trying to be more strict on whitespace)
to be pretty slow.  The PIR parser only parses 43K of text per
second while the GenBank one is but 6.6K/second.  Compare that
to the SwissProt parser where I was parsing the whole file
in 20 minutes, which is about 200K per second.

These tests were done on different machines, but there's only
about a factor of 2 performance difference between them.
(Comparison done by running my genbank regression test on
my Intel laptop and on the bioperl.org Alpha machine, which
is where the PIR and GenBank tests are run.  My laptop is
faster although sshd on bioperl takes 50% of the CPU.)

I can only think of a few reasons which might cause this:

  1) Martel is intrinsicly slow - but see sprot as a counter example

  2) These two files use indented whitespace for continuations an
to indicate subitems. Almost every time you get to the end of a
line it needs to test if the next line is a continuation.  In most
cases it isn't, so about 1/4 of the file is read twice.  But that's
not a factor of 20.

  3) Brad has a list of possible feature key names and a list of
qualifiers.   Odds are you have to scan 1/2 the list before
finding a matching name.  This again causes some duplicate
checks, but only in the features section and I just can't see
another factor of two out of that.

  4) The regexp to allow folding with the whitespace indentation
is something like:
  indicator + \
   Group("tag", text) + \
   Rep(space_indent + Group("tag", text))

This can make for some very large regular expressions.  GenBank,
when expressed as a string, is about 6K long and the generated
tag table itself is hard to guess, but it's roughly 100K while
PIR is about 600K.  These are state transition tables so perhaps
I'm loosing cache coherency because most of my jumps are too
large.  I don't know what effect sshd has on the overall
bioperl.org performance.  It only have 72K of RSS so I can't
see how there's a bad context swap hit.

  I can't find any equivalent on Linux to IRIX's 'osview'
or 'gr_osview', which is what I usually used to look at this
sort of overhead.  Any pointers?

  5) I'm using the same RecordReader for SWISS-PROT and
GenBank (EndsWith) so that shouldn't be a problem.  However,
in the first I think I was using the reader directly while
with GenBank I'm going through the HeaderFooter parser.
There might be some difference there, but I can't think of
what that might be.

  6) Memory use

I'm using gbpri8 as my test case.  The first entry, HUAF001549,
is about 260K long with 202K bases.  This causes my format
definition to take up 50MB (!) of memory according to top,
so a 20-fold expansion.  My test with SWISS-PROT and MDL's .mol
files only needed a factor of about 6 as I recall.  I don't
know why so much memory is needed for GenBank and I didn't
look at PIR's use to compare.
  As an aside, Edwin Steele points out that LMFLCHR12 has
2Mbases so is about an order of magnitude larger.  Well, RAM
is cheap.

Without Martel running, bioperl.org's 'free' says:
          total       used     free   shared  buffers   cached
Mem:     126568     121048     5520    58568     4544    77624
-/+ buffers/cache:   38880    87688
Swap:    208760      23760   185000

When I run the test, it says:
           total       used     free   shared  buffers   cached
Mem:      126568     123056     3512    57656     3688    29760
-/+ buffers/cache:    89608    36960
Swap:     208760      23704   185056

Compare that to top's
 PID USER  PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM  TIME COMMAND
7930 dalke  17   0 53240  51M  1824 R     51M 49.8 21.0  5:43 python

As I read it, all of the memory is being used, but 77MB was
used for cache.  When the python job started, that moved
out and giving 47MB to python, so it's all running in main
memory.  Only about 56K more of swap is being used, so there
isn't a lot of page swapping going on.

I've ordered a new disk for my laptop and more memory.  That
will give me a chance to test everything on dedicated machine.
Hopefully the problem is simply context switch overhead with
the sshd2 and http sessions on bioperl.org.

I've put off doing real work for too long so I won't have time
to look at this for a couple of weeks.   If anyone wants to work
out what the problem is using the latest code, it's on
biopython.org in /tmp/dalke/gb/Martel .  It's now in the tedious
part of timing and profiling.  (One approach might be to take
a section of a file, duplicate it a lot of times, and measure
how the times and memory use changes as a function of size.)

Hmm.  There is another difference between the GenBank format
and the others.  I'm using the \R construct for newline detection.
Perhaps there's some unexpected performance hit there, though I
can't see what that would be.

                    Andrew
                    dalke at acm.org