[BioPython] RE: Biopython XML parsers

Mike Cariaso MCariaso at Endogenybio.com
Fri Jun 20 12:29:56 EDT 2003


Yesterday I asked a question about generating a new parser for an XML
format (MAGE-ML). I've had some success and believe I can now answer my
own question, but I'd still like to run this past the biopythonistas,
and anyone who's got more python experience than me.

Q: How do I use a sax parser with an iterator?

A: Since the XML document has a nested structure, my initial attempts
were failing, since I had the opening, but not closing tags of the XML
headers.

Ie. The full XML doc looks like this

<xml>
<group>
  <BioSequence>
  </BioSequence>

  <BioSequence>
  </Biosequence>
</group>

<group>
  <BioSequence>
  </BioSequence>

  <BioSequence>
  </Biosequence>
</group>
</xml>

I wanted my iterator to iterate over each BioSequence block. But it was
choking because at the first iteration all it could see was:

<xml>
<group>
  <BioSequence>
  </BioSequence>

Which is not valid, since there are no closing </group> or </xml> tags.

So the iterator needs to strip the records down to just the
<BioSequence>...stuff...</BioSequence> blocks.

Most of the bits of the example that relate to a Scanner, can (I
believe) be replaced by xml.sax related standard modules.

The working code is below. I imagine the powers that be are pretty busy
actually creating biopython, and holding down real jobs. So I welcome
feedback, and especially advice. But I post this more so that it will go
into the achives, and possibly be useful to others.

#!/usr/bin/env python

from Bio.ParserSupport import *
from xml.sax import make_parser

class MageIterator:

    def __init__(self, handle, parser=None):
        if type(handle) is not FileType and type(handle) is not
InstanceType:
            raise ValueError, "I expected a file handle or file-like
object"
        self._uhandle = File.UndoHandle(handle)
        self._parser = parser

    def next(self):
        while 1:                #Strip any leading
            try:   line = safe_peekline(self._uhandle)
            except SyntaxError: break
            if not line: break
            if -1 != line.find('<BioSequence '):break
            self._uhandle.readline()

        lines = []
        while 1:                #Capture the biosequence record
            line = self._uhandle.readline()
            if not line: break
            lines.append(line)
            if lines and -1 != line.find('</BioSequence>'):break

        if not lines:
            return None
                                #call the parser to fire the handlers
        data = string.join(lines, '')
        if self._parser is not None:
self._parser.parse(File.StringHandle(data))
        return data

class _MageConsumer(AbstractConsumer):
    def _unhandled(self, *foo):
        pass

    def startElement(self, name, attrs):
        if name == 'BioSequence':
            print 'ID\t',attrs['identifier']
            if 'name' in attrs.keys() : print 'Name\t',attrs['name']

if __name__ == '__main__':
    fh = open('MG_U74Av2_annot_small.xml')
    parser = make_parser()
    parser.setContentHandler(_MageConsumer())

    iterator = MageIterator(fh, parser)

    while 1:
        rec = iterator.next()
        if rec is None: break
    print 'done'






More information about the BioPython mailing list