[Biopython-dev] Martel now supports attributes
Andrew Dalke
dalke at dalkescientific.com
Wed Jul 11 10:00:57 EDT 2001
I finally got a chance to do something I proposed a long time ago.
Martel now supports attributes for the XML events.
[dalke at pw600a biopython]$ python
Python 2.0 (#4, Dec 8 2000, 21:23:00)
[GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from Martel import *
>>> format = Group("seq", Re("[ATCG]+"), {"type": "dna"}) + AnyEol()
>>> from xml.sax import saxutils
>>> gen = saxutils.XMLGenerator()
>>> parser = format.make_parser()
>>> parser.setContentHandler(gen)
>>> parser.parseString("GATTACA\n")
<?xml version="1.0" encoding="iso-8859-1"?>
<seq type="dna">GATTACA</seq>
>>>
So the new part here is the optional 3rd arg to Martel.Group,
which is the dictionary to use for the attributes. The result
is shown in the <seq> tag, which now includes the attribute
'type="dna"'.
The regular expression pattern language was modified to allow
persisting the attributes to/from the ?P<> group name.
>>> str(format)
'(?P<seq?type=dna>[ACGT]+)(\\n|\\r\\n?)'
>>>
This is actually encoded like the query component of a URL,
so the following is allowed
(?P<spam?a=8&homedir=%7Edalke>...)
and corresponds to a startElement of:
<spam a="8" homedir="~dalke">
The reason for this change is to make it easier to support different
formats and versions. Currently I've been using tags like
<swissprot38><swissprot38_record> ...
Now I can do:
<seqdb dbname="swissprot" release="38">
<record format="swissprot">
ID <id type="id">100K_RAT</id>
AC <it type="accession">Q12345</id>; ...
SQ <sequence>EKLADWERDN ADEDLE</sequence>
</record>
</seqdb>
and if tag names are chosen consistently across the databases then
something like a FASTA conversion can be made very generic - just
get the 'id type="id"' and <sequence> fields of each <record>.
I've also added the old Martel-specific regression tests back the
the main biopython CVS tree. This doesn't have the format specific
tests (like for PIR, BLAST, etc.), excepting SWISS-PROT.
I change the 'sre_parse.py' and 'sre_constants.py' files to be
'msre_parse.py' and 'msre_constants.py' because I was all too often
running into conflicts between those files and the ones in the
now standard Python distribution.
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list