[Biopython-dev] Martel timings
Andrew Dalke
dalke at acm.org
Thu Oct 12 03:55:37 EDT 2000
I'm starting to compare Martel parsing with the existing biopython code.
I wrote a Martel document handler called SwissProtBuilder.py (attached)
which creates Bio.SwissProt.Record objects.
The output is comparable to the existing code, although this new code
is wrong. (There are a couple of minor things I need to fix in the
grammer to make the parsing easy.)
The timings are also comparable. The biopython code is about 8% slower
than the Martel code. The Martel code takes about 25 minutes to parse
sprot38.
Because of the new RecordReader, it only needed about 4MB of memory.
I assume the biopython code is at least that good.
One of the reasons for the good performance on the Martel side is that
I'm pruning the expression tree to get rid of events which aren't handled
by the callback object. That eliminates a lot of function call overhead.
I also turned a long if/elif chain in endElement into a dispatch table,
which saved more time because of the conversion from an O(N) lookup to
O(1).
It turns out there is a bug in the pruning code because the RecordReader
doesn't prune its children. It doesn't cause an error but just a slowdown,
so I didn't notice it until now. I've included a patch with this email
which brings Martel-0.3 up to my internal development version.
Andrew
dalke at acm.org
-------------- next part --------------
"""SwissProtBuilder - create a biopython Bio.SwissProt.Record
This is a first attempt at a Martel interface to create SwissProt
records. It is incomplete because Martel's SWISS-PROT format
definition is a bit lacking, although not enough to affect timings.
I have a test data set which is the first 200024 lines of sprot38. It
takes this code 59.9 seconds to parse the file while the existing
biopython code takes 65.1 seconds, so about 8% faster. There is still
some performance I can eek out of this.
All of sprot38 takes around 25 minutes to parse. The mxTextTools
analysis takes about 10 minutes so the rest is spent in callbacks and
creation code.
"""
import string
from Bio.SwissProt import KeyWList, SProt
from xml.sax import saxlib
# These are elements whose text I want to get
capture_names = ("entry_name", "data_class_table",
"molecule_type", "sequence_length", "ac_number",
"day", "month", "year", "description", "gene_names",
"organism_species", "organelle",
"organism_classification", "reference_number",
"reference_position", "reference_comment",
"bibliographic_database_name",
"bibliographic_identifier", "reference_author",
"reference_title", "reference_location",
"comment_text", "database_identifier",
"primary_identifier", "secondary_identifier",
"status_identifier", "keyword", "ft_name", "ft_from",
"ft_to", "ft_description", "molecular_weight",
"crc32", "sequence", )
# These are all of the elements events I'm interested in
select_names = capture_names + \
("swissprot38_record", "DT_created", "DT_seq_update",
"DT_ann_update", "reference", "feature", "ID",
"reference", "DR", "comment")
class SwissProtBuilder(saxlib.DocumentHandler):
def __init__(self):
self.records = []
self.capture = 0
def startElement(self, name, attrs):
# Arranged in order of most used to least
if name in capture_names:
self.capture = 1
self.text = ""
elif name == "reference":
self.reference = SProt.Reference()
elif name == "feature":
self.ft_desc = ""
elif name == "comment":
self.comment = ""
elif name == "swissprot38_record":
self.record = SProt.Record()
elif name == "DT_created":
self.in_date = "created"
self.date = []
elif name == "DT_seq_update":
self.in_date = "sequence_update"
self.date = []
elif name == "DT_ann_update":
self.in_date = "annotation_update"
self.date = []
def characters(self, ch, start, length):
if self.capture:
self.text = self.text + ch[start:start+length]
def endElement(self, name):
# Doing the dispatch like this instead of a chain of if/elif
# statements saved me about 15% because the lookup time goes
# from O(N) to O(1)
f = getattr(self, "end_" + name, None)
if f is not None:
f()
if self.capture:
del self.text
self.capture = 0
def end_swissprot38_record(self):
self.record.sequence = string.replace(self.record.sequence,
" ", "")
# Delete for now since I'm just doing timings
#self.records.append(self.record)
#print self.record
del self.record
def end_entry_name(self):
self.record.entry_name = self.text
def end_data_class_table(self):
self.record.data_class = self.text
def end_molecule_type(self):
self.record.molecule_type = self.text
def end_sequence_length(self):
# Used in both the ID and the SQ lines
self.seq_len = int(self.text)
def end_ID(self):
self.record.sequence_length = self.seq_len
def end_ac_number(self):
self.record.accessions.append(self.text)
def end_day(self):
self.date.append(self.text)
def end_month(self):
self.date.append(self.text)
def end_year(self):
self.date.append(self.text)
setattr(self.record, self.in_date, "%s-%s-%s" % tuple(self.date))
def end_description(self):
if self.record.description == "":
self.record.description = self.text
else:
self.record.description = self.record.description + self.text
def end_gene_names(self):
# XXX parser isn't correct
self.record.gene_name = self.text
def end_organism_species(self):
# XXX parser isn't correct
self.record.organism = self.text
def end_organelle(self):
# XXX parser isn't correct
self.record.organelle = self.text
def end_organism_classification(self):
# XXX parser isn't correct
self.record.organism_classification.extend(\
string.split(self.text[:-1], "; "))
def end_reference(self):
self.record.references.append(self.reference)
del self.reference
def end_reference_number(self):
self.reference.number = int(self.text)
def end_reference_position(self):
# XXX Why is this a list?
self.reference.positions.append(self.text)
def end_reference_comment(self):
# XXX needs to be list of (token, text)
self.reference.comments.append(self.text)
def end_bibliographic_database_name(self):
self.bib_db_name = self.text
def end_bibliographic_identifier(self):
self.reference.references.append( (self.bib_db_name, self.text) )
def end_reference_author(self):
if self.reference.authors:
self.reference.authors = self.reference.authors + " " + self.text
else:
self.reference.authors = self.text
def end_reference_title(self):
if self.reference.title:
self.reference.title = self.reference.title + " " + self.text
else:
self.reference.title = self.text
def end_reference_location(self):
if self.reference.location:
self.reference.location = self.reference.location + " " + self.text
else:
self.reference.location = self.text
def end_comment_text(self):
if self.comment:
self.comment = self.comment + " " + self.text
else:
self.comment = self.text
def end_comment(self):
self.record.comments.append(self.comment)
def end_database_identifier(self):
self.db_id = self.text
def end_primary_identifier(self):
self.ids = [self.text]
def end_secondary_identifier(self):
self.ids.append(self.text)
def end_status_identifier(self):
self.ids.append(self.text)
def end_DR(self):
self.record.cross_references.append( (self.db_id,) + tuple(self.ids))
def end_keyword(self):
# XXX parser isn't correct
kw = string.split(self.text[:-1], "; ")
self.record.keywords.extend(kw)
def end_feature(self):
self.record.features.append( (self.ft_name, self.ft_from,
self.ft_to, self.ft_desc) )
def end_ft_name(self):
self.ft_name = string.rstrip(self.text)
def end_ft_from(self):
self.ft_from = string.lstrip(self.text) # Jeff first tries int ...
def end_ft_to(self):
self.ft_to = string.lstrip(self.text) # Jeff first tries int ...
def end_ft_description(self):
if self.ft_desc:
self.ft_desc = self.ft_desc + " " + self.text
else:
self.ft_desc = self.text
def end_molecular_weight(self):
self.mw = int(self.text)
def end_crc32(self):
self.record.seqinfo = (self.seq_len, self.mw, self.text)
def end_sequence(self):
# Strip out spaces in end_swissprot38_record
self.record.sequence = self.record.sequence + self.text
def test():
from Martel.formats import swissprot38
from xml.sax import saxutils
import Martel
import time
t1 = time.time()
# Send only the events which the callback will use
# (saves another 32% of performance, after doing the if/elif speedup)
format = Martel.select_names(swissprot38.format, select_names)
parser = format.make_parser()
dh = SwissProtBuilder()
parser.setDocumentHandler(dh)
eh = saxutils.ErrorRaiser()
parser.setErrorHandler(eh)
#infile = open("/home/dalke/src/Martel/examples/sample.swissprot")
#infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
infile = open("/home/dalke/ftps/swissprot/smaller_sprot38.dat")
t2 = time.time()
parser.parseFile(infile)
t3 = time.time()
print "startup", t2-t1
print "eval", t3-t2
if __name__ == "__main__":
test()
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Martel-0.3.patch
Type: application/octet-stream
Size: 1053 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/a846b25c/Martel-0.3.obj
More information about the Biopython-dev
mailing list