[Biopython-dev] building a data object

Fri Dec 21 06:03:03 EST 2001

Bioperl has only one sequence record data object.  One of the points
behind Biopython's two parsing systems was to allow the building of
different objects without having to rewrite the parser as well.
(BioJava has a similar goal, but is more akin to the first Biopython
parser and not the Martel one.)

Take the example I gave in my previous post:

iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):
    do_something(record.document)

In this case, the 'Builder()' is an object which translates SAX events
from whichever format is given into a 'document' of whatever is
desired.  For example, it could be a

  Swissprot2SeqRecordBuilder
  GenBank2LightweightSeqBuilder
  ...

Basically, there are two free variables -- input file type and object
to make.  So this needs some sort of double dispatch mechanism.

(That's not strictly true.  A GenBank specific data type may only
support being built from a GenBank record.  For example, a GenBank
record to HTML converter need only support GenBank.)

Because of the combinitorial explosion, there won't be all that many
generalized intermediate formats.  I can think of perhaps four:
   - a "standard" sequence record
   - a "lightweight" sequence record, when FASTA-style data is enough
       (If the tag names and semantics are consistent across the
        different formats, this can be nearly trivial.)
   - an alignment record
   - some sort of structure data type.

Since there is (will be) format detection, there needs to be some way
to determine the right builder given only the requested output type.
The implementation is something like this:

def readFile(class_to_build, infile):
    format = set_of_allowed_possibilities.recognizeFile(infile)
    iterator = format.make_iterator("record")
    Builder = figure_out_builder(format, class_to_build)
    for record in iterator.parseFile(open(filename), Builder()):
        yield record.document

so someone can say

from Bio import SeqRecord, IO

for record in IO.readFile(SeqRecord.SeqRecord, open("unknown.dat")):
   do_something(record)

(there should also be a readString, for symmetry with the XML code in
Martel.)

I think the best way to implement 'figure_out_builder' is to
ask the class for it, perhaps via a static class method.

   class_to_build.get_builder(format)

then this requires either a registration system or some way to
determine the builder's location as a module.

(eg, the Builder to convert a "swissprot/version=38" format into
a SeqRecord could be returned by calling
  Bio.bioformats.swissprot.SeqRecord.get_builder({"version": "38"})
)

Another way to do the API is to make 'readFile' a static method
of the SeqRecord object.  This gets rid of the 'IO' module.

from Bio import SeqRecord
for record in SeqRecord.SeqRecord.readFile(open("unknown.dat")):
   do_something(record)

This looks funny to me, especially since Python doesn't really have
static methods.  Python 2.2 makes them easier to write.  A third
option is to use a function in the module namespace, as in

from Bio import SeqRecord
for record in SeqRecord.readFile(open("unknown.dat")):
   do_something(record)

This is probably the most traditional and appropriate solution.  On
the other hand, the functionality can't be added automatically through
inheritance, which makes it harder to remember what to do.  There will
need to be an explicit creation of the function, as in

from Bio import IO

readFile = IO.ReadFile(SeqRecord)

Expanding even further, perhaps there should be an "io" object, with
this and the write methods (next email):

from Bio import SeqRecord
for record in SeqRecord.io.readFile(open("unknown.dat")):
   do_something(record)

My problem is that I know this is a double dispatch problem, but I
don't know the right way to solve it.  I can think of many - perhaps
too many. :(

                                Andrew
                                dalke at dalkescientific.com