[Biopython-dev] Medline XLM parsers

Wed May 5 10:32:14 EDT 2004

Hi Marc,

> Two questions: first, it seem that none of the current xml classes 
> handle the latest release. Is this correct?

That's right.  There's no parser for the latest release.  I haven't 
looked at the latest format yet, but usually the changes are pretty 
minor.  It should not be too hard to update the 
nlmmedline_011101_format to handle the latest files.

>  And second,
> how would you use those classes to parse and xml document? From my 
> understanding of martel, I would still need to make an xml parser 
> which then makes this seem odd.

Yep, Martel is a SAX parser!  You'd parse the Martel format the same 
way as you'd parse an XML file.  You have to create a 
xml.sax.handler.ContentHandler object to receive each of the events you 
care about.

Warning: untested code!  :)

from xml.sax import handler

class MyHandler(handler.ContentHandler):
     def __init__(self):
         ...
     def startElement(self, name, attrs):
         ...
     def characters(self, content):
         ...
     def endElement(self, name):
         ...

my_content = MyHandler()
format = NLMMedlineXML.choose_format(open(filename).read(1000))
parser = format.citation_format.make_parser()
parser.setContentHandler(my_content)
parser.setErrorHandler(handler.ErrorHandler())
parser.feed(open(filename))

To get the whole record, the startElement, characters, and endElement 
functions in your content handler has to store all the different 
elements that appear in the MEDLINE record.  Because there are many 
elements, doing so is a lot of work!  It would probably be useful, but 
I worry about the speed.  Martel has to generate function calls for 
each element, and function calls are slow in Python.  If you want to do 
that for all of MEDLINE, then the little bits of time for function 
calls add up to some real time.  When I've used Martel to parse MEDLINE 
in the past, I've created specialized content handlers to pull out only 
the elements I was interested in.  You can tell Martel to ignore the 
rest of the elements (using the "select_names" function), which speeds 
things up considerably.

Jeff