[Biopython-dev] mixins
Andrew Dalke
adalke at mindspring.com
Wed Jan 2 08:27:40 EST 2002
Been working on mixins all night. The idea is that
only parts of a file are important -- you may just want
the sequence, or the cross references, or the whatever.
If those fields are consistently tagged (been working on
that as well) then standard parsers can be used for the
different segments.
Some of the experimental mixins I have are
dbid -- gives the primary/secondary/accessions
description -- gives the main description text
dbxref -- cross references to other databases
features -- sequence features
sequence -- sequence data
A problem with the standard SAX method is that is
uses a centralized set of methods, like 'startElement'.
Mixins can't each define their own startElements since
only one is called. So I made a DispatchHandlers
which converts calls like
startElement('spam', {})
into
start_spam('spam', {})
in that way, the different handlers could listen only
for their associated event. And for the 'characters'
method, I have a stack based was to start and stop saving
characters.
when a mixin is done, it calls a specific function back
in the handler, which so far start with 'add_'.
But wait, there's more! Jeff pointed out namespace
support, which XML supports with a syntax like 'ns:spam'.
It's kinda cumbersome using a ':' as a method name, so I've
translated that to "ns__spam" when I do the dispatching.
This lets people define a new builder with something like
class FastaBuilder(dbid, description, sequence,
SaveText, DispatchHandler):
def __init__(self):
... call __init__ on the bases
def start_record(self, tag, attrs):
self.id = None
self.description = None
self.seq = None
def add_dbid(self, dbid):
...
def add_sequence(self, seq):
...
def end_record(self, tag):
self.document = FastaRecord(self.id, self.description. self.seq)
Now, writing that list of mixins is cumbersome, so I used
new.classobject so you can define
FastaBuilderBase = MixinBuilder(dbid, description, sequence)
class FastaBuilder(FastaBuilderBase):
...
Another problem with mixins is that they share the same __dict__.
That can lead to hard-to-track-down mixups. So I've written
a way for a mixing to acquire methods from another handler,
but not share the same __dict__. It looks like this:
class Handle_sequence(Callback):
def start_bioformat__sequence(self, tag, attrs):
self.alphabet = attrs.get("alphabet", "any")
def end_bioformat__sequence(self, tag):
seq = Sequence based on the alphabet and the characters
self.callback(seq)
# Here's the mixin
class sequence:
def __init__(self):
acquire(self, Handle_sequence(self, self.add_sequence))
def add_sequence(self):
pass
The 'acquire' function pulls off all methods starting with
'start_' and 'end_' and sticks them in the mixin'a namespace.
So it looks like the sequence implements things but it's really
Handle_sequence. And there's no possibility of 'self.alphabet'
being overridden by anyone else.
(It's actually slightly more complicated than this because
the acquisition can put on its own prefix, which helps with
code reuse.)
Finally! Since Python is fully introspective, the DispatchHandler
can peer through the class hierarchy to figure out all of the
methods which are defined, and map them back to their proper
SAX tags. This list of tags can then be used to build a new
expression tree with all the other, unused tags filtered out.
What that means is, if you want to get more fields, stick in a
new mixin, and everything works automatically to get those
fields, with only the expected slowdown associated with the
extra work to identify and parse those fields.
The less data you want, the faster it is. With bare minimal
(what's needed to convert the data into FASTA format), my
test set of SWISS-PROT 38 takes (estimated) 10 minutes. With
everything needed for the SProt data structure, it's slightly
under 30 minutes, which is about what the current code
requires.
(Times estimated by extrapolation of my smaller test set.)
Code is in a state of flux and not really work others looking
at it right now. I'll work on it more tomorrow. Hope to make
it available on Friday.
Andrew
More information about the Biopython-dev
mailing list