[Biopython-dev] record iteration
Andrew Dalke
adalke at mindspring.com
Fri Dec 21 06:02:26 EST 2001
Most data files are of this form:
<dataset>
<header>...</header> (optional)
<record>...</record> (one or more)
<footer>...</footer> (optional)
</dataset>
Nearly everyone only wants to read the records from this file, using a
mechanism like this:
for record in file:
do_something(record)
and don't care about the header and footer information. In Martel
this can be done by passing in the tag name of the record boundary to
the make_iterator method.
iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):
do_something(record.document)
If we standardize on the tag name of "record" then this will work for
everything.
The existing formats I wrote do not use this standard because they
only allowed a tag name. They had things like "swissprot38_record".
With the changes I made this summer, Martel grammers can include
attributes for the element, as in:
<record format="swissprot" version="38">
...
</record>
So my proposal is to standardize on certain tag names, to be shared
across all of the Biopython/Martel grammars. These include:
dataset
record
header
footer
and allow for a standard scaffold for parsing sequence records.
BTW, those standard tag names should also include
primary_id
description (free-form text)
sequence (single letter codes)
sequence3 (three letter code)
xref (cross reference to another database)
... others?
As we rework the format definitions, some of these will become apparant.
This starts getting into BioXML-type work.
Andrew
dalke at dalkescientific.com
More information about the Biopython-dev
mailing list