[Biopython-dev] swissprot parsing performance comparisons
Andrew Dalke
dalke at acm.org
Wed Jan 10 04:12:40 EST 2001
My paper about Martel for the Python conference was
accepted. One of the reviewers wanted more comparisons
with existing projects so I've been doing that. In
case you are interested, here are the timings for parsing
SWISS-PROT release 38:
Time Method Description
---- ------ -------------------------------
0:56.63 grep.time grep ^ID
1:27.89 rec_reader.time Martel's RecordReader.StartsWith "ID"
1:47.71 swissknife.time lazy Swissknife (doesn't parse the fields)
1:59.78 swissknife.time ditto, to check reproducibility
6:41.14 swissprot38_no_tags.time Martel but without tag elements
8:47.09 swissknife_id_sq.time Swissknife, extracting ID and SQ
9:22.43 swissprot38_id_sq.time Martel with entry_name & sequence tags
23:28.59 SwissProtBuilder.time Martel building Biopython's SP records
28:54.69 biopython.time Biopython building its own SP records
30:12.85 bioperl.time Bioperl building its own SP records
38:20.65 swissknife_full.time Swissknife with full parsing enabled
Some notes:
- I like that the RecordReader is only 50% slower than grep!
- Swissknife is at ftp://ftp.ebi.ac.uk/pub/software/swissprot/
- Swissknife contains a performance problem when reading SQ
records. I commented out some of the problem and sent email
to the authors about it.
- the ID & SQ records emulates the minimum parsing needed for
FASTA generation
- Martel's SwissProtBuilder imports the old xml libraries and
needed to be fixed before use.
- swissprot38_id_sq is the same program as SwissProtBuilder but
with all of the tags removed except for "entry_name",
"sequence" and "swissprot38_record". (The last is present
as a sanity counter so I know the parse is progressing.)
Some extra performance could be gained by making a document
document handler which is more specific to the task.
- the "SP records" are the existing biopython SwissProtRecord
- fully 3/4 of SwissProtBuilder is spent in function callbacks
for tags and in object creation and not directly in parsing.
- the current CVS version of biopython's swissprot parser will
not parse release 38 because it says the OX record is
required. Changing its "one_or_more" value to 0 fixed things.
- bioperl and biopython likely capture somewhat different
data so they cannot be directly compared.
- swissknife is perhaps the least stringent parser followed by
bioperl. Biopython and Martel are much pickier.
It is hard to judge if the reason for this is because of
the natural inclination of libraries in the two languages
because the two perl packages are from the same programming
"culture" (EBI/Sanger) as are the two python packages
(had the same employers at the same time) :)
Andrew
More information about the Biopython-dev
mailing list