[Biopython-dev] Medline XLM parsers
Jeffrey Chang
jeffrey_chang at stanfordalumni.org
Wed May 5 10:32:14 EDT 2004
Hi Marc,
> Two questions: first, it seem that none of the current xml classes
> handle the latest release. Is this correct?
That's right. There's no parser for the latest release. I haven't
looked at the latest format yet, but usually the changes are pretty
minor. It should not be too hard to update the
nlmmedline_011101_format to handle the latest files.
> And second,
> how would you use those classes to parse and xml document? From my
> understanding of martel, I would still need to make an xml parser
> which then makes this seem odd.
Yep, Martel is a SAX parser! You'd parse the Martel format the same
way as you'd parse an XML file. You have to create a
xml.sax.handler.ContentHandler object to receive each of the events you
care about.
Warning: untested code! :)
from xml.sax import handler
class MyHandler(handler.ContentHandler):
def __init__(self):
...
def startElement(self, name, attrs):
...
def characters(self, content):
...
def endElement(self, name):
...
my_content = MyHandler()
format = NLMMedlineXML.choose_format(open(filename).read(1000))
parser = format.citation_format.make_parser()
parser.setContentHandler(my_content)
parser.setErrorHandler(handler.ErrorHandler())
parser.feed(open(filename))
To get the whole record, the startElement, characters, and endElement
functions in your content handler has to store all the different
elements that appear in the MEDLINE record. Because there are many
elements, doing so is a lot of work! It would probably be useful, but
I worry about the speed. Martel has to generate function calls for
each element, and function calls are slow in Python. If you want to do
that for all of MEDLINE, then the little bits of time for function
calls add up to some real time. When I've used Martel to parse MEDLINE
in the past, I've created specialized content handlers to pull out only
the elements I was interested in. You can tell Martel to ignore the
rest of the elements (using the "select_names" function), which speeds
things up considerably.
Jeff
More information about the Biopython-dev
mailing list