[Biopython-dev] Martel timings

Thu Oct 12 03:55:37 EDT 2000

I'm starting to compare Martel parsing with the existing biopython code.
I wrote a Martel document handler called SwissProtBuilder.py (attached)
which creates Bio.SwissProt.Record objects.

The output is comparable to the existing code, although this new code
is wrong.  (There are a couple of minor things I need to fix in the
grammer to make the parsing easy.)

The timings are also comparable.  The biopython code is about 8% slower
than the Martel code.  The Martel code takes about 25 minutes to parse
sprot38.

Because of the new RecordReader, it only needed about 4MB of memory.
I assume the biopython code is at least that good.

One of the reasons for the good performance on the Martel side is that
I'm pruning the expression tree to get rid of events which aren't handled
by the callback object.  That eliminates a lot of function call overhead.
I also turned a long if/elif chain in endElement into a dispatch table,
which saved more time because of the conversion from an O(N) lookup to
O(1).

It turns out there is a bug in the pruning code because the RecordReader
doesn't prune its children.  It doesn't cause an error but just a slowdown,
so I didn't notice it until now.  I've included a patch with this email
which brings Martel-0.3 up to my internal development version.

                    Andrew
                    dalke at acm.org

-------------- next part --------------
"""SwissProtBuilder - create a biopython Bio.SwissProt.Record

This is a first attempt at a Martel interface to create SwissProt
records.  It is incomplete because Martel's SWISS-PROT format
definition is a bit lacking, although not enough to affect timings.

I have a test data set which is the first 200024 lines of sprot38.  It
takes this code 59.9 seconds to parse the file while the existing
biopython code takes 65.1 seconds, so about 8% faster.  There is still
some performance I can eek out of this.

All of sprot38 takes around 25 minutes to parse.  The mxTextTools
analysis takes about 10 minutes so the rest is spent in callbacks and
creation code.

"""

import string
from Bio.SwissProt import KeyWList, SProt
from xml.sax import saxlib

# These are elements whose text I want to get
capture_names = ("entry_name", "data_class_table",
                 "molecule_type", "sequence_length", "ac_number",
                 "day", "month", "year", "description", "gene_names",
                 "organism_species", "organelle",
                 "organism_classification", "reference_number",
                 "reference_position", "reference_comment",
                 "bibliographic_database_name",
                 "bibliographic_identifier", "reference_author",
                 "reference_title", "reference_location",
                 "comment_text", "database_identifier",
                 "primary_identifier", "secondary_identifier",
                 "status_identifier", "keyword", "ft_name", "ft_from",
                 "ft_to", "ft_description", "molecular_weight",
                 "crc32", "sequence", )

# These are all of the elements events I'm interested in
select_names = capture_names + \
               ("swissprot38_record", "DT_created", "DT_seq_update",
               "DT_ann_update", "reference", "feature", "ID",
               "reference", "DR", "comment")

class SwissProtBuilder(saxlib.DocumentHandler):
    def __init__(self):
        self.records = []
        self.capture = 0

    def startElement(self, name, attrs):
        # Arranged in order of most used to least
        if name in capture_names:
            self.capture = 1
            self.text = ""
        elif name == "reference":
            self.reference = SProt.Reference()
        elif name == "feature":
            self.ft_desc = ""
        elif name == "comment":
            self.comment = ""
        elif name == "swissprot38_record":
            self.record = SProt.Record()
        elif name == "DT_created":
            self.in_date = "created"
            self.date = []
        elif name == "DT_seq_update":
            self.in_date = "sequence_update"
            self.date = []
        elif name == "DT_ann_update":
            self.in_date = "annotation_update"
            self.date = []

    def characters(self, ch, start, length):
        if self.capture:
            self.text = self.text + ch[start:start+length]

    def endElement(self, name):
        # Doing the dispatch like this instead of a chain of if/elif
        # statements saved me about 15% because the lookup time goes
        # from O(N) to O(1)
        f = getattr(self, "end_" + name, None)
        if f is not None:
            f()

        if self.capture:
            del self.text
            self.capture = 0

    def end_swissprot38_record(self):
        self.record.sequence = string.replace(self.record.sequence,
                                              " ", "")
        # Delete for now since I'm just doing timings
        #self.records.append(self.record)
        #print self.record
        del self.record

    def end_entry_name(self):
        self.record.entry_name = self.text
    def end_data_class_table(self):
        self.record.data_class = self.text
    def end_molecule_type(self):
        self.record.molecule_type = self.text
    def end_sequence_length(self):
        # Used in both the ID and the SQ lines
        self.seq_len = int(self.text)
    def end_ID(self):
        self.record.sequence_length = self.seq_len
    def end_ac_number(self):
        self.record.accessions.append(self.text)
    def end_day(self):
        self.date.append(self.text)
    def end_month(self):
        self.date.append(self.text)
    def end_year(self):
        self.date.append(self.text)
        setattr(self.record, self.in_date, "%s-%s-%s" % tuple(self.date))

    def end_description(self):
        if self.record.description == "":
            self.record.description = self.text
        else:
            self.record.description = self.record.description + self.text
    def end_gene_names(self):
        # XXX parser isn't correct
        self.record.gene_name = self.text
    def end_organism_species(self):
        # XXX parser isn't correct
        self.record.organism = self.text
    def end_organelle(self):
        # XXX parser isn't correct
        self.record.organelle = self.text
    def end_organism_classification(self):
        # XXX parser isn't correct
        self.record.organism_classification.extend(\
                string.split(self.text[:-1], "; "))

    def end_reference(self):
        self.record.references.append(self.reference)
        del self.reference
    def end_reference_number(self):
        self.reference.number = int(self.text)
    def end_reference_position(self):
        # XXX Why is this a list?
        self.reference.positions.append(self.text)
    def end_reference_comment(self):
        # XXX needs to be list of (token, text)
        self.reference.comments.append(self.text)
    def end_bibliographic_database_name(self):
        self.bib_db_name = self.text
    def end_bibliographic_identifier(self):
        self.reference.references.append( (self.bib_db_name, self.text) )
    def end_reference_author(self):
        if self.reference.authors:
            self.reference.authors = self.reference.authors + " " + self.text
        else:
            self.reference.authors = self.text
    def end_reference_title(self):
        if self.reference.title:
            self.reference.title = self.reference.title + " " + self.text
        else:
            self.reference.title = self.text
    def end_reference_location(self):
        if self.reference.location:
            self.reference.location = self.reference.location + " " + self.text
        else:
            self.reference.location = self.text
    def end_comment_text(self):
        if self.comment:
            self.comment = self.comment + " " + self.text
        else:
            self.comment = self.text
    def end_comment(self):
        self.record.comments.append(self.comment)
    def end_database_identifier(self):
        self.db_id = self.text
    def end_primary_identifier(self):
        self.ids = [self.text]
    def end_secondary_identifier(self):
        self.ids.append(self.text)
    def end_status_identifier(self):
        self.ids.append(self.text)
    def end_DR(self):
        self.record.cross_references.append( (self.db_id,) + tuple(self.ids))
    def end_keyword(self):
        # XXX parser isn't correct
        kw = string.split(self.text[:-1], "; ")
        self.record.keywords.extend(kw)
    def end_feature(self):
        self.record.features.append( (self.ft_name, self.ft_from,
                                      self.ft_to, self.ft_desc) )
    def end_ft_name(self):
        self.ft_name = string.rstrip(self.text)
    def end_ft_from(self):
        self.ft_from = string.lstrip(self.text)  # Jeff first tries int ...
    def end_ft_to(self):
        self.ft_to = string.lstrip(self.text)  # Jeff first tries int ...
    def end_ft_description(self):
        if self.ft_desc:
            self.ft_desc = self.ft_desc + " " + self.text
        else:
            self.ft_desc = self.text
    def end_molecular_weight(self):
        self.mw = int(self.text)
    def end_crc32(self):
        self.record.seqinfo = (self.seq_len, self.mw, self.text)
    def end_sequence(self):
        # Strip out spaces in end_swissprot38_record
        self.record.sequence = self.record.sequence + self.text

def test():
    from Martel.formats import swissprot38
    from xml.sax import saxutils
    import Martel
    import time
    t1 = time.time()

    # Send only the events which the callback will use
    # (saves another 32% of performance, after doing the if/elif speedup)
    format = Martel.select_names(swissprot38.format, select_names)

    parser = format.make_parser()
    dh = SwissProtBuilder()
    parser.setDocumentHandler(dh)
    eh = saxutils.ErrorRaiser()
    parser.setErrorHandler(eh)

    #infile = open("/home/dalke/src/Martel/examples/sample.swissprot")
    #infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    infile = open("/home/dalke/ftps/swissprot/smaller_sprot38.dat")

    t2 = time.time()
    parser.parseFile(infile)
    t3 = time.time()
    print "startup", t2-t1
    print "eval", t3-t2

if __name__ == "__main__":
    test()

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Martel-0.3.patch
Type: application/octet-stream
Size: 1053 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/a846b25c/Martel-0.3.obj