[BioPython] Martel-0.2 announcement

Wed, 23 Aug 2000 00:12:51 -0600

Hello,

I've released Martel-0.2 at http://www.biopython.org/~dalke/Martel/

I also made my poster from BOSC available, at
  http://www.biopython.org/~dalke/Martel/BOSC2000.poster/
I would say it went over pretty well, or rather, I told a lot of people
about it and it looked like a few understood :)

The biggest changes to this version are:
  - support for named group references (like '(?P<spam>...)(?P=spam)'
    which matches "ABCABC" and "byebye" but not "123321" nor "bye-bye").
    This hasn't been tested throughly against the re parser.

  - an experimental terms, "named repeats", like
 r"(?P<num_atoms>\d+\n((?P<x>.....) (?P<y>.....) (?P<z>.....)\n){num_atoms}"
    In this case, the repeat count for the atom lines is based on the
    integer value of the previous num_atoms match.  This was needed for
    the MDL CT file parser, and was suggested by Roger Sayle.

Because of implementation difficulties, any and all parsers using the
above two features must be single threaded.  (There is no thread limitation
that I know of if these features aren't used.)

  - Some example programs:
   o  A basic program for converting SWISS-PROT records into HTML using
      the SAX interface directly.  (I want to rewrite it in DOM to see what
      that's like - or you could give it a go.)

   o  A data file to XML converter.  If you don't specify the file format,
      it will try to guess it automatically.

I also did some timing tests, which are mentioned in the README at
  http://www.biopython.org/~dalke/Martel/README.txt
Basically, on my laptop (a 233 MHz machine, I think) I can parse the 250MB
of swissprot38 in 20 minutes if no processing is being done (only
callbacks).
If there aren't any callbacks, it takes 10.5 minutes.  Because of memory
constraints, this testing was done with the trick used by Jeff of writing
a very simple scanner to read a record in at a time and Martel only parses
that record, instead of all the records at once.  I like those times :)

I'm not going to be able to do much work for this for at least a month
or so, but I hope people start playing around with it.

Jeff and I talked about it over BOSC and decided we should evaluate it
as a canditate for biopython.  That means:
  o  is it useful?
      - does it do everything we want?  (most importantly, can we build our
            internal data structures using it)
      - SAX can get confusing, but too confusing?  What about DOM?
  o  is the performance acceptable and/or what can be done to improve it?
  o  is the mentioned workaround for memory usage acceptable?  What can be
      done to make that workaround trasparent for clients?
  o  is the requirement for the mxTextTools C extension okay?
      - mxTextTools comes with a pure Python implementation as well, but
        I haven't tested it.
  o  some more bullet points I'm probably missing.

                    Andrew
                    dalke@acm.org