[Biopython-dev] format autodection

Andrew Dalke adalke at mindspring.com
Fri Dec 21 06:02:17 EST 2001

Hey all,

I'm getting back to working on Biopython.  I want to spend some time
on the file parsing code.  (Like, duh! :) The topics I want to work on
next include:

  - automatic file identification
  - iterating through records in a file
  - support for different record types
  - converting/writing records to a given format

I'll send an email for each point, starting now.

I have some ideas on file identification.  In theory, Martel could be
used by just |'ing the terms, except that:
  - some files may by parsable by multiple formats
  - a Martel definition parses the whole file, when file type
     identification need only parse part of the file
  - it's a linear search

What I'm toying around with is something like this:

def _recognizeFile(parser, infile):
    pos = infile.tell()
    err_h = ... something which can distinguish between a bad
        parse, and a successful one where unparsed text remains
        (I'm changing Martel to distiguish the two.)
        except Martel.Parser.ParserError:
    return err_h.successful_parse

class Format:
  def __init__(self, format_name, expression, recognize_expression = None,
               provider_url = None, documentation_url = None,
               description, short_description, maintainer...):
    if recognize_expression is None:
        recognize_expression = expression
    self.expression = expression
  def recognizeFile(self, infile):
    if _recognizeFile(self.recognize_expression.make_parser(), infile):
        return self
    return None

class RecognizeFormats:
  def __init__(self, recognize_expression, formats = None):
  def recognizeFile(self, infile):
    if _recognizeFile(self.recognize_expression.make_parser(), infile):
        for format in self.formats:
            x = format.recognizeFile(infile)
            if x is not None:
                return x
    return None

This makes it possible to say

  from bioformats import swissprot
  swissprot38 = Format("swissprot/version=38",
                       expression = swissprot.swissprot38.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot39 = Format("swissprot/version=39",
                       expression = swissprot.swissprot39.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot40 = Format("swissprot/version=40",
                       expression = swissprot.swissprot40.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot = RecognizeFormats(
                Martel.Str("ID  ") + Martel.ToEol() + \
                Martel.Str("AC  ") + Martel.ToEol(),
                [swissprot40, swissprot39, swisprot38])

  swissprot_like = RecognizeFormats(
                Martel.Re(r"[^ ][^ ]   "),
                [swissprot, ipi, ...])

  # This has GenBank records in a row/ no header
  genbank_records = Format("genbank", ...)
  # This has the header for the Genbank release
  genbank_release = Format("genbank-release", ...)

  genbank = RecognizeFormats(None, [genbank_records, genbank_release])

  # Not saying this is the best prefilter
  pdb = RecognizeFormats(Martel.Re("ATOM  |HETATM|HEADER"),
                         [many variations])

  sequence_format = RecognizeFormats(None,
                       [swissprot_like, genbank, pdb, ...])

  structure_format = RecognizeFormats(None, [pdb, mdl, ...])

  any = RecognizeFormats(None, [sequence, alignment, structure])

The result can be used like this:

  format = sequence_format.recognizeFile(open("unknown.file"))
  print "It's a", format.name

I've tried this out.  It works.  Given a file or string, I can get a
Format definition which (claims to) parse it.

There are several things I haven't figured out:

1) How are the formats named?  I made up "swissprot/version=38".  Is
the version attribute enough?  If there are other attributes, is there
a canonical ordering of attributes.

2) Does the word "recognize" make sense in this context?  I tried
"identifier" but that's also a commonly used noun.  (I choose
"recognize" from a post of Thomas's from the end of summer.)q

3) Is information about the intermediate nodes in the tree useful?

4) How are new formats registered?  Manually?  Or is there a way to
autoadd them by dropping files in appropriately designated

5) The top-level definitions require all the lower-level definitions
to be available.  If there are 50 formats, that might take a while.
There needs to be some way to defer loading modules until the parent
RecognizeFormats class is asked to recognize something.

6) Version detection depends on tell/seek working.  There needs to be
a simple wrapper for inputs (like URLs, and sys.stdin) which don't
support that action.  Jeff added something like this already.

7) What do I do with the format definition once I have it?

8) Does this idea make sense to others?

                                   dalke at dalkescientific.com

More information about the Biopython-dev mailing list