[Biopython-dev] record iteration

Andrew Dalke adalke at mindspring.com
Fri Dec 21 06:02:26 EST 2001

Most data files are of this form:

<header>...</header>  (optional)
<record>...</record>  (one or more)
<footer>...</footer>  (optional)

Nearly everyone only wants to read the records from this file, using a
mechanism like this:

for record in file:

and don't care about the header and footer information.  In Martel
this can be done by passing in the tag name of the record boundary to
the make_iterator method.

iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):

If we standardize on the tag name of "record" then this will work for

The existing formats I wrote do not use this standard because they
only allowed a tag name.  They had things like "swissprot38_record".
With the changes I made this summer, Martel grammers can include
attributes for the element, as in:

<record format="swissprot" version="38">

So my proposal is to standardize on certain tag names, to be shared
across all of the Biopython/Martel grammars.  These include:

and allow for a standard scaffold for parsing sequence records.

BTW, those standard tag names should also include
  description (free-form text)
  sequence   (single letter codes)
  sequence3  (three letter code)
  xref       (cross reference to another database)
  ... others?

As we rework the format definitions, some of these will become apparant.

This starts getting into BioXML-type work.

                                   dalke at dalkescientific.com

More information about the Biopython-dev mailing list