[Biopython-dev] record iteration

Andrew Dalke adalke at mindspring.com
Fri Dec 21 06:02:26 EST 2001


Most data files are of this form:

<dataset>
<header>...</header>  (optional)
<record>...</record>  (one or more)
<footer>...</footer>  (optional)
</dataset>


Nearly everyone only wants to read the records from this file, using a
mechanism like this:

for record in file:
    do_something(record)

and don't care about the header and footer information.  In Martel
this can be done by passing in the tag name of the record boundary to
the make_iterator method.

iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):
    do_something(record.document)

If we standardize on the tag name of "record" then this will work for
everything.

The existing formats I wrote do not use this standard because they
only allowed a tag name.  They had things like "swissprot38_record".
With the changes I made this summer, Martel grammers can include
attributes for the element, as in:

<record format="swissprot" version="38">
 ...
</record>

So my proposal is to standardize on certain tag names, to be shared
across all of the Biopython/Martel grammars.  These include:
  dataset
  record
  header
  footer

and allow for a standard scaffold for parsing sequence records.

BTW, those standard tag names should also include
  primary_id
  description (free-form text)
  sequence   (single letter codes)
  sequence3  (three letter code)
  xref       (cross reference to another database)
  ... others?

As we rework the format definitions, some of these will become apparant.

This starts getting into BioXML-type work.

                                   Andrew
                                   dalke at dalkescientific.com





More information about the Biopython-dev mailing list