[Biopython-dev] Bioformat module

Sat Jan 5 10:24:29 EST 2002

Jeff:
>Hmmm...  That would be:
>Bio.Bioformats              vs    Bio.Formats
>Bio.Bioformats.Format       vs    Bio.Formats.Format
>Bio.Bioformats.formats      vs    Bio.Formats.formats
>
>Actually, I would favor a refactoring that would put end-user modules
>like IO, Writer, Registry right under Bio.  This would be consistent
>with the idea discussed in the last BOSC of having a wider tree to
>make it easier for people to find things.

Very good point, and I had forgotten about that discussion.
Okay, I did a somewhat hybrid solution and put a 'Format'
in front of a couple names but otherwise I merged the two
trees together.  The new modules are:

Format -- information about a format
FormatRegistry  -- knows how to use the format information
FormatIO -- knows how to use the format registry
Std -- defines standard XML tags
Dispatch -- a set of classes to make it easier to mix and
    match handlers
StdHandler -- a set of standard handlers, which use the
    standard tags to build portions of the data
ReseekFile -- help reading from files which don't allow
    reseeking to the beginning of the file
_FmtUtils -- internal support modules
Writer -- base class for the output writers

formatdefs -- high-level description of the formats
expressions -- low-level Martel expressions for the formats
builders -- makes data structures from Martel events
writers -- turns data structures into output

__init__ contains 'formats', which is an instance of a
   FormatRegistry.  It reads the 'formatdefs' directory to
   get the configuration information.
SeqRecord contains an 'io' object, which is an instance
   of FormatIO.

As it is right now, the format support is rather weak.  There
are two formats -- swissprot/38 and an embl variation.  There
is one output format, FASTA.

Here's an example of use

>>> filename = "/home/sac/bioperl-live-sac/t/data/roa1.swiss"
>>> from Bio import SeqRecord
>>> for record in SeqRecord.io.readFile(open(filename)):
...     print record.id
...     print record.seq
...
ROA1_HUMAN
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKR.....
>>>

Here's another (the description really is on a single line)

>>> SeqRecord.io.convert(open(filename))
>ROA1_HUMAN HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN A1 (HELIX-
DESTABILIZING PROTEIN) (SINGLE-STRAND BINDING PROTEIN) (HNRNP
 CORE PROTEIN A1).
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMN
ARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKK
RGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFG
RGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDS
YNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS
SSSSYGSGRRF
>>>

Quickly speaking, here's what's going on:
1) format detection
   The 'formatdefs' contain a description of the different
formats.  Some formats are really lists of other formats, in
a tree.  The tree structure looks like this:

sequence
  |- embl
  |- swissprot
  |     `- swissprot/38
  `- others

The SeqRecord.io contains the default reader format, which is
"sequence".  The sequence format tries each of its children.
Eventually, 'swissprot/38' works, which is returned as the
format definition.

2) find the builder

The SeqRecord.io contains a canonical name for the data type.
In this case it's "SeqRecord."  The file format has its own
canonical name, which is "swissprot/38".  They also have what
I call an abbrev name, which is a name that can be used in
the file system.  The format's abbrev name is 'sprot38'.
So the initial builder is found in

  Bio/builders/SeqRecord/sprot38.py

However, this doesn't exist.  Here's where the hierarchy comes
into play.  The hierarchy must be such that of Y is a child of
X then all the tags which are defined in X must have the same
meanings in Y.  In that way, the parser to build from X can
be used to build from Y.

In other words,
  Bio/builders/SeqRecord/swissprot.py
  Bio/builders/SeqRecord/sequence.py
if one exists, should be just as usable as .../sprot38.py

This reduces the O(NxN) problem to a O(N) problem.

The Bio.Std module defines standard tag names.

3) the format contains the Martel grammer, so once the builder
is found, the file can be parsed.  When a record is parsed,
the content handler (the builders) must end up with a
".document" property.  This is the object to use for a record.
It's also what the DOM object use.  By using this convention
I know how to get the 'record' from the builders, to return
in the for loop.

4) Output conversion is also done with canonical names.
In this case, the SeqRecord also defines a default output
format.  (If not found, it searches down the hierarchy tree
instead of up.)  Writers have the following protocol:
  writeHeader() -- usually does nothing
  write(record) -- write a record
  writeFooter() -- usually does nothing

5) The Dispatch classes are designed to help with making
new data structures easily.  It's too complicated to explain
right now.

6) To add a new format definition:
  a) make sure you understand the hierarchy requirement
  b) take a look at the swissprot and embl expressions/, to see
      how to use the Std module to define tags.  (I need to
      think about 'style' for a while more.)
  c) edit the formatdefs directory to add the new format
      configuration.

Okay, I can't go any further.  I still need to pack for my
trip.  Be back next week.

Enjoy!

                    Andrew
                    dalke at dalkescientific.com