[Biopython-dev] format autodection
Andrew Dalke
adalke at mindspring.com
Fri Dec 21 06:02:17 EST 2001
Hey all,
I'm getting back to working on Biopython. I want to spend some time
on the file parsing code. (Like, duh! :) The topics I want to work on
next include:
- automatic file identification
- iterating through records in a file
- support for different record types
- converting/writing records to a given format
I'll send an email for each point, starting now.
I have some ideas on file identification. In theory, Martel could be
used by just |'ing the terms, except that:
- some files may by parsable by multiple formats
- a Martel definition parses the whole file, when file type
identification need only parse part of the file
- it's a linear search
What I'm toying around with is something like this:
def _recognizeFile(parser, infile):
pos = infile.tell()
err_h = ... something which can distinguish between a bad
parse, and a successful one where unparsed text remains
(I'm changing Martel to distiguish the two.)
parser.setErrorHandler(err_h)
try:
try:
parser.parseFile(infile)
except Martel.Parser.ParserError:
pass
finally:
infile.seek(pos)
return err_h.successful_parse
class Format:
def __init__(self, format_name, expression, recognize_expression = None,
provider_url = None, documentation_url = None,
description, short_description, maintainer...):
if recognize_expression is None:
recognize_expression = expression
self.expression = expression
...
def recognizeFile(self, infile):
if _recognizeFile(self.recognize_expression.make_parser(), infile):
return self
return None
class RecognizeFormats:
def __init__(self, recognize_expression, formats = None):
...
def recognizeFile(self, infile):
if _recognizeFile(self.recognize_expression.make_parser(), infile):
for format in self.formats:
x = format.recognizeFile(infile)
if x is not None:
return x
return None
This makes it possible to say
from bioformats import swissprot
swissprot38 = Format("swissprot/version=38",
expression = swissprot.swissprot38.format,
recognize_expression = swissprot.swissprot38.record)
swissprot39 = Format("swissprot/version=39",
expression = swissprot.swissprot39.format,
recognize_expression = swissprot.swissprot38.record)
swissprot40 = Format("swissprot/version=40",
expression = swissprot.swissprot40.format,
recognize_expression = swissprot.swissprot38.record)
swissprot = RecognizeFormats(
Martel.Str("ID ") + Martel.ToEol() + \
Martel.Str("AC ") + Martel.ToEol(),
[swissprot40, swissprot39, swisprot38])
swissprot_like = RecognizeFormats(
Martel.Re(r"[^ ][^ ] "),
[swissprot, ipi, ...])
# This has GenBank records in a row/ no header
genbank_records = Format("genbank", ...)
# This has the header for the Genbank release
genbank_release = Format("genbank-release", ...)
genbank = RecognizeFormats(None, [genbank_records, genbank_release])
# Not saying this is the best prefilter
pdb = RecognizeFormats(Martel.Re("ATOM |HETATM|HEADER"),
[many variations])
sequence_format = RecognizeFormats(None,
[swissprot_like, genbank, pdb, ...])
structure_format = RecognizeFormats(None, [pdb, mdl, ...])
any = RecognizeFormats(None, [sequence, alignment, structure])
The result can be used like this:
format = sequence_format.recognizeFile(open("unknown.file"))
print "It's a", format.name
I've tried this out. It works. Given a file or string, I can get a
Format definition which (claims to) parse it.
There are several things I haven't figured out:
1) How are the formats named? I made up "swissprot/version=38". Is
the version attribute enough? If there are other attributes, is there
a canonical ordering of attributes.
2) Does the word "recognize" make sense in this context? I tried
"identifier" but that's also a commonly used noun. (I choose
"recognize" from a post of Thomas's from the end of summer.)q
3) Is information about the intermediate nodes in the tree useful?
4) How are new formats registered? Manually? Or is there a way to
autoadd them by dropping files in appropriately designated
directories?
5) The top-level definitions require all the lower-level definitions
to be available. If there are 50 formats, that might take a while.
There needs to be some way to defer loading modules until the parent
RecognizeFormats class is asked to recognize something.
6) Version detection depends on tell/seek working. There needs to be
a simple wrapper for inputs (like URLs, and sys.stdin) which don't
support that action. Jeff added something like this already.
7) What do I do with the format definition once I have it?
8) Does this idea make sense to others?
Andrew
dalke at dalkescientific.com
More information about the Biopython-dev
mailing list