[Biopython-dev] swissprot parsing performance comparisons

Andrew Dalke dalke at acm.org
Wed Jan 10 04:12:40 EST 2001

My paper about Martel for the Python conference was
accepted.  One of the reviewers wanted more comparisons
with existing projects so I've been doing that.  In
case you are interested, here are the timings for parsing
SWISS-PROT release 38:

  Time  Method             Description
  ----  ------             -------------------------------
0:56.63 grep.time          grep ^ID 
1:27.89 rec_reader.time    Martel's RecordReader.StartsWith "ID"
1:47.71 swissknife.time    lazy Swissknife (doesn't parse the fields)
1:59.78 swissknife.time    ditto, to check reproducibility
6:41.14 swissprot38_no_tags.time   Martel but without tag elements
8:47.09 swissknife_id_sq.time   Swissknife, extracting ID and SQ
9:22.43 swissprot38_id_sq.time  Martel with entry_name & sequence tags
23:28.59 SwissProtBuilder.time  Martel building Biopython's SP records
28:54.69 biopython.time         Biopython building its own SP records
30:12.85 bioperl.time           Bioperl building its own SP records
38:20.65 swissknife_full.time   Swissknife with full parsing enabled

Some notes:
  - I like that the RecordReader is only 50% slower than grep!
  - Swissknife is at ftp://ftp.ebi.ac.uk/pub/software/swissprot/
  - Swissknife contains a performance problem when reading SQ
     records.  I commented out some of the problem and sent email
     to the authors about it.
  - the ID & SQ records emulates the minimum parsing needed for
     FASTA generation
  - Martel's SwissProtBuilder imports the old xml libraries and
      needed to be fixed before use.
  - swissprot38_id_sq is the same program as SwissProtBuilder but
      with all of the tags removed except for "entry_name",
      "sequence" and "swissprot38_record".  (The last is present
      as a sanity counter so I know the parse is progressing.)
      Some extra performance could be gained by making a document
      document handler which is more specific to the task.
  - the "SP records" are the existing biopython SwissProtRecord
  - fully 3/4 of SwissProtBuilder is spent in function callbacks
      for tags and in object creation and not directly in parsing.
  - the current CVS version of biopython's swissprot parser will
      not parse release 38 because it says the OX record is
      required.  Changing its "one_or_more" value to 0 fixed things.
  - bioperl and biopython likely capture somewhat different
      data so they cannot be directly compared.
  - swissknife is perhaps the least stringent parser followed by
      bioperl.  Biopython and Martel are much pickier.
      It is hard to judge if the reason for this is because of
      the natural inclination of libraries in the two languages
      because the two perl packages are from the same programming
      "culture" (EBI/Sanger) as are the two python packages
      (had the same employers at the same time)  :)


More information about the Biopython-dev mailing list