[Biopython-dev] Martel now supports attributes

Andrew Dalke dalke at dalkescientific.com
Wed Jul 11 10:00:57 EDT 2001


I finally got a chance to do something I proposed a long time ago.
Martel now supports attributes for the XML events.

[dalke at pw600a biopython]$ python
Python 2.0 (#4, Dec  8 2000, 21:23:00)
[GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from Martel import *
>>> format = Group("seq", Re("[ATCG]+"), {"type": "dna"}) + AnyEol()
>>> from xml.sax import saxutils
>>> gen = saxutils.XMLGenerator()
>>> parser = format.make_parser()
>>> parser.setContentHandler(gen)
>>> parser.parseString("GATTACA\n")
<?xml version="1.0" encoding="iso-8859-1"?>
<seq type="dna">GATTACA</seq>
>>>

So the new part here is the optional 3rd arg to Martel.Group,
which is the dictionary to use for the attributes.  The result
is shown in the <seq> tag, which now includes the attribute
'type="dna"'.

The regular expression pattern language was modified to allow
persisting the attributes to/from the ?P<> group name.

>>> str(format)
'(?P<seq?type=dna>[ACGT]+)(\\n|\\r\\n?)'
>>>

This is actually encoded like the query component of a URL,
so the following is allowed

   (?P<spam?a=8&homedir=%7Edalke>...)

and corresponds to a startElement of:

   <spam a="8" homedir="~dalke">

The reason for this change is to make it easier to support different
formats and versions.  Currently I've been using tags like

<swissprot38><swissprot38_record> ...

Now I can do:

<seqdb dbname="swissprot" release="38">
  <record format="swissprot">
ID   <id type="id">100K_RAT</id>
AC   <it type="accession">Q12345</id>;  ...
SQ   <sequence>EKLADWERDN ADEDLE</sequence>
  </record>
</seqdb>

and if tag names are chosen consistently across the databases then
something like a FASTA conversion can be made very generic - just
get the 'id type="id"' and <sequence> fields of each <record>.

I've also added the old Martel-specific regression tests back the
the main biopython CVS tree.  This doesn't have the format specific
tests (like for PIR, BLAST, etc.), excepting SWISS-PROT.

I change the 'sre_parse.py' and 'sre_constants.py' files to be
'msre_parse.py' and 'msre_constants.py' because I was all too often
running into conflicts between those files and the ones in the
now standard Python distribution.

                    Andrew
                    dalke at acm.org





More information about the Biopython-dev mailing list