[BioPython] RE: Biopython XML parsers
Mike Cariaso
MCariaso at Endogenybio.com
Fri Jun 20 12:29:56 EDT 2003
Yesterday I asked a question about generating a new parser for an XML
format (MAGE-ML). I've had some success and believe I can now answer my
own question, but I'd still like to run this past the biopythonistas,
and anyone who's got more python experience than me.
Q: How do I use a sax parser with an iterator?
A: Since the XML document has a nested structure, my initial attempts
were failing, since I had the opening, but not closing tags of the XML
headers.
Ie. The full XML doc looks like this
<xml>
<group>
<BioSequence>
</BioSequence>
<BioSequence>
</Biosequence>
</group>
<group>
<BioSequence>
</BioSequence>
<BioSequence>
</Biosequence>
</group>
</xml>
I wanted my iterator to iterate over each BioSequence block. But it was
choking because at the first iteration all it could see was:
<xml>
<group>
<BioSequence>
</BioSequence>
Which is not valid, since there are no closing </group> or </xml> tags.
So the iterator needs to strip the records down to just the
<BioSequence>...stuff...</BioSequence> blocks.
Most of the bits of the example that relate to a Scanner, can (I
believe) be replaced by xml.sax related standard modules.
The working code is below. I imagine the powers that be are pretty busy
actually creating biopython, and holding down real jobs. So I welcome
feedback, and especially advice. But I post this more so that it will go
into the achives, and possibly be useful to others.
#!/usr/bin/env python
from Bio.ParserSupport import *
from xml.sax import make_parser
class MageIterator:
def __init__(self, handle, parser=None):
if type(handle) is not FileType and type(handle) is not
InstanceType:
raise ValueError, "I expected a file handle or file-like
object"
self._uhandle = File.UndoHandle(handle)
self._parser = parser
def next(self):
while 1: #Strip any leading
try: line = safe_peekline(self._uhandle)
except SyntaxError: break
if not line: break
if -1 != line.find('<BioSequence '):break
self._uhandle.readline()
lines = []
while 1: #Capture the biosequence record
line = self._uhandle.readline()
if not line: break
lines.append(line)
if lines and -1 != line.find('</BioSequence>'):break
if not lines:
return None
#call the parser to fire the handlers
data = string.join(lines, '')
if self._parser is not None:
self._parser.parse(File.StringHandle(data))
return data
class _MageConsumer(AbstractConsumer):
def _unhandled(self, *foo):
pass
def startElement(self, name, attrs):
if name == 'BioSequence':
print 'ID\t',attrs['identifier']
if 'name' in attrs.keys() : print 'Name\t',attrs['name']
if __name__ == '__main__':
fh = open('MG_U74Av2_annot_small.xml')
parser = make_parser()
parser.setContentHandler(_MageConsumer())
iterator = MageIterator(fh, parser)
while 1:
rec = iterator.next()
if rec is None: break
print 'done'
More information about the BioPython
mailing list