[BioPython] parsing with Martel

Andrew Dalke Andrew Dalke" <dalke@dalkescientific.com
Mon, 13 May 2002 00:49:36 -0600


Jay Hesselberth:
but if I've got something like:
>
><tag> DATA </tag>;
>
>where the semicolon at the end is unwanted, the semicolon ends up in a
>TEXT node in the parsed xml.  I'm a bit confused about this, as I was
>(naiively) under the impression that things like xml.sax.ContentHandler
>don't care about untagged stuff.

Martel passes everything to the handler because it doesn't know which
information should be ignored.  Eg, you may want the character position
of each tag for use in an indexer.  The DOM turns all the characters()
events into a TEXT node, which is what you see.

>I guess what I would like to do is be able to post-filter the output,
>removing everything that remains untagged after converting the file to
>xml.  Is there a built-in mechanism for this?

Built-in?  No, but you can write your own filter for this very easily.
Here's an untested example.

class IgnoreUntaggedText(xml.sax.ContentHandler):
    def __init__(self, handler, ignore_depth = 0):
      self.handler = handler
      self.ignore_depth = 0
    def startDocument(self):
      self.depth = 0
      self.handler.startDocument()
    def endDocument(self):
      self.handler.endDocument()
    def startElement(self, tag, attrs):
      self.depth += 1
      self.handler.startElement(tag, attrs)
    def endElement(self, tag):
      self.handler.endElement(tag)
      self.depth -= 1
    def characters(self, s):
      if self.depth > self.ignore_depth:
        self.characters(s)

If you have a handler 'h' then you can use the above with

  h = IgnoreUntaggedText(h)


What this does is ignore all characters events which aren't inside
of at least "ignore_depth" tags.  Everything else is forwarded to
the original handler.

                    Andrew