[Biopython-dev] Is Martel always appropriate?

Andrew Dalke adalke at mindspring.com
Fri Nov 30 01:04:12 EST 2001


Cayte:
>It looks to me that Martel is not a good fit for SAF.  That is because the
>approprate action depends on state information.  If the line starts with
>"zebra" append the sequence to the zebra sequence, if it starts with
>"giraffe" append to the giraffe sequence , if it starts with "unicorn"
>append to the unicorn sequence.
>
>A simpler approach would be to filter the comments out, then split each
line
>and use the label component as
>a selector.  Please share your opinions.

I've had a hard time trying to figure out an answer to this.  Perhaps
the best is to start with the philosophy behind Martel.

Many formats needs to be read in bioinformatics.  Nearly everyone
writes parsers from scratch.  That wastes time and gets boring.  What
I want is a tool to help identify parts of the text and provide
a standard framework for building a parser.

Strictly speaking, Martel is a tokenizer, not a parser.  That means
it is used to identify parts of the text, but all it does with that
information is pass the subtext and some description off to the real
parser.  The parser takes these two things and does whatever is
appropriate, which is usually building up some sort of data structure.

The boundary between the two is vague and learned by experience.
For example, at one extreme, here's a format definition for SAF and
every other format.

format = Martel.Group("character", Martel.Re(r"[\000-\377]"))

In this case, the tokenizer isn't doing anything to help understand
the format -- all the work is passed off on the parser, and all
Martel does is provide a SAX-based parser framework.

Martel can help provide better tokenization, even when it doesn't
know everything in the sytem.  For example, it seems there are three
types of lines in the SAF format

# name_1   EFQEDQENVN PEKAAPAQQP RTRAGLAVLR AGNSRGAGGA PTLPETLNVA
line1_format = Group("line",
                     Group("name", Re(r"[\w]{1,14}")) + \
                     Re("[ \t]+") + \
                     ToEOL("sequence"))

#                     10         20         30         40
line2_format = Group("numbers",
                     Re(r" (\d+ *)*\R"))


comment_line_format = Group("comment", Str("#") + ToEol())

line_format = line1_format + line2_format + comment_line_format

format = Rep(line_format)

All this does is help figure out which lines contain sequences
and which should be ignored, and of those with sequences, it
says which characters belong to the "name" and which belong
to the "sequence".

The parser (the SAX handler) is the one in charge of turning this
information into something useful, and is the one which does the
state transitions.  For example, here's one which might work

class SAFHandler(handler.ContentHandler):
  def startDocument(self):
    self.capture = 0
    self.text = None
    self.sequences = {}
    self.guide_name = None
    self.current_name = None
    self.block = {}

  def startElement(self, tag, attrs):
    if tag == "name" or tag == "sequence":
      self.capture = 1
      self.text = ""

  def characters(self, text):
    if self.capture:
      self.text = self.text + text

  def _new_block(self):
    for name, seq in self.block:
      self.sequences[name] = self.sequences.get(name, "") + seq
    self.block.clear()

  def endElement(self, tag):
    if tag == "name":
      self.current_name = self.text
      if self.text == self.guide_name:  # start a new block
        self._new_block()
      elif self.guide_name is None:     # keep track of the guide name
        self.guide_name = self.text
    elif tag == "sequence":
      if not self.block.has_key(self.current_name):  # no duplicates in a
block
        self.block[self.current_name] = self.text.replace(" ", "")
    self.capture = 0

  def endDocument(self):
    self._new_block()

After parsing, the handler's 'sequences' should contain all the
sequence entries, and its 'guide_name' tells which of those should be
listed first.  (The format definition at
  http://www.embl-heidelberg.de/predictprotein/Dexa/optin_safDes.html
implies the order of the other items is arbitrary.)


So in this case, Martel isn't powerful enough to detect when new
blocks arise, but it is able to provide some context to simplify
matters for the parser.  On the other hand, here's how the parser
would be written in a more standard style

_line_pat = re.compile(r"([^ ]{1,14})[ \t]+(.*)")

def _new_block(block, sequences):
  for name, seq in block:
    sequences[name] = sequences.get(name, "") + seq
  block.clear()

def parse(infile):
  block = {}
  sequences = {}
  guide_name = None
  for line in infile.readlines():  # assume enough memory
    if line.startsWith("#"):
      continue
    if line.startsWith(" "):  # an approximation for the 'numbers' lines
      continue
    # Find the first space or tab
    m = _line_pat.match(line, "[ \t]")
    if m is None:
      raise "bad format"
    name = m.group(1)
    seq = m.group(2)
    if name == guide_name:
      _new_block(block, sequences)
    elif guide_name is None:
      guide_name = name
    if not block.has_key(name):
      block[name] = block.get(name, "") + seq.replace
  _new_block(block, sequences)
  return sequences

This is about 30 lines, compared to about 55.  It's shorter and
easier to understand.  As you say, it's simpler.

A reason it's simpler is because the format is very simple.
There's almost no structure to it.  Another reason is
because the Martel handler needs to store possibly multiple
'characters' callbacks.  It's also simpler because the parser
only needs to do one things -- build up a data structure.  This
SAF format doesn't need things like an XML markup generator or an
indexer or support for multiple variations of a format.  Martel
looses partially because it is too flexible.

Addressing your statement:
>Martel is not oriented to state machines
>except as implied by the ordering of expressions.

You are correct.  But Martel is only intended to be half the
solution.  The other half is the callback handler.  Together
they handle SAF just fine, except with about twice the complexity
if all you want to do is turn SAF into a data structure.

It's possible to conceive of a different way to receive callbacks
which is specialized for this case.  Consider something like this:

from Martel import SimpleHandler

# "SimpleHandler" is a hypothetical base class which stores all
# the events inside a set of named elements and returns them as
# a simple dictionary data structure where the key is the
# tag name and the values are all of the matching text.  Hierarchical
# data structures are ignored.  This is similar to Sean McGrath's
# RAX approach.

class MyHandler(SimpleHandler.SimpleHandler):
  def startDocument(self):
    self.sequences = {}
    self.block = {}
    self.guide_name = None

  def parseLine(self, terms):
    name = terms["name"]
    if self.guide_name = name:
      self._new_block(self)
    elif self.guide_name is None:
      self.guide_name = name
    if not self.block.has_key(name):
      self.block[name] = string.replace(terms["seq"], " ", "")

  def endDocument(self):
     self._new_block()

  def _new_block(self):
    for name, seq in self.block:
      self.sequences[name] = self.sequences.get(name, "") + seq

handler = MyHandler(want = ("line",))
parser.setHandler(handler)
parser.parse(file)

and this is of comparable length to the hand-rolled code.

If this task of parsing simple data structures is common
enough then perhaps the best solution is to write a specialized
handler base class which abstracts out the generic dirty work.

                    Andrew
                    dalke at dalkescientific.com
P.S.
  And no, I didn't test any of this code :)





More information about the Biopython-dev mailing list