[BioPython] looking for datafile parsers
Andrew Dalke
dalke@acm.org
Thu, 11 Jan 2001 05:32:21 -0700
Hello,
I'm working on a parser generator as part of the
Biopython development. It's getting towards completion
which means it's time to start writing papers about it. :)
Indeed, my paper was accepted for a talk at the upcoming
Python conference. One of the reviewers wanted more
information comparing my work to others in the field, so
I've been digging up related project. I figure on writing
another paper for Bioinformatics which will include some
more of this information.
The most similar program is SRS, which is also a parser
generator, although they are context free while my parser is
(mostly) regular. I tried to get a copy of the reference
paper (from Meth.Enzy.) from the library but it was checked
out. I would love it if someone would offer to answer a
few questions for my about it, and to run some benchmarks
to see how fast it parses swissprot38, say, as compared to
how long it takes the bioperl code to parse the same file.
Any takers?
There are a few projects which allow users to specific
a format using a configuration description which can roughly
be classified as a regular expression pattern matcher sitting
on top of line type recognizer. This includes Biopy and
BioDB-Loader as well as the current Biopython parser. Another
class of projects uses a common data structure then implements
readers/writers to the different formats at the expense of
throwing away some data, such as bioperl and SeqIO. Swissknife
is an example of a library which reads/writes from a single
format into a data format tailored specifically to that format.
A few are special case programs (grep, NiceProt, sp2fasta)
which do one and only one thing, although in the case of
sw2xml that one thing converts the format (SWISS-PROT) to
another format (XML) for which many tools are readily available.
Most of the packages throw away formatting information and
only store the physical data, although get-sprot-entry is a nice
example of why keeping presentation information is useful.
The program creates an HTML page which looks the same as the
original format except that various fields are marked up with
hyperlinks. Finally, the project I've been working on, Martel,
lets you develop parsers which handle most, if not all, of
these cases.
I want to make sure I covered everything so I've been searching
for SWISS-PROT parsers as my prototypical example. A
description of what I found is below. If something major is
missing, please tell me. If you can provide assistence with
the SRS, GCG, Java or Lisp parts, also please tell me.
Here's a key to some of the notation I use in the listings below:
count == count the number of records in a database
offset == generate offsets into the file for fast indexing
fasta == extract data for FASTA (ID, AC and SQ fields)
generic == extract generic sequence data, usually as a
data structure containing fields common to multiple formats
but ignoring some SWISS-PROT specific fields
all == extract all fields
validate == validate that a record is in the correct format
markup == identifies fields and saves the layout data so as
to allow HTML markup without otherwise changing the format
(timings not given for markup since it will depend on the
specific markup requested, and because only Martel and
get-sprot-entry preserve markup)
Performance is measured against the 80,000 records of
swissprot38
grep - http://www.gnu.org/gnulist/production/grep.html
written in C
count (when used as "grep ^ID | wc")
takes 0m:57s to parse sprot38
offset (when used as "grep -b ^ID")
cannot be used for fasta, generic, all, validate, markup
one really large regular expression (here as a bit of humor)
written in C
cannot be used for count, offset, fasta, generic, all, markup
can be used for validate in theory, but I haven't tested it
bioperl - http://www.bioperl.org/
written in Perl
count (as a special case of generic)
fasta (as a special case of generic)
generic
takes 30m:13s to parse sprot38
cannot be used for index (?), all, validate, markup
biopython - http://www.biopython.org/
written in Python
count (as a special case of all)
fasta (as a special case of all)
generic (as a special case of all)
all
takes 28m:55s to parse sprot38
validate
cannot be used for index(?), markup
biojava - http://www.biojava.org/
written in Java
unknown (have source but need to figure it out)
performance unknown (don't know how to code in Java)
Martel - http://www.biopython.org/~dalke/Martel/
written in Python with a C extension
count
RecordReader.StartsWith "ID" takes 1m28s to parse sprot38
index
fasta (standard format def. but only using the ID and SQ tags)
takes 9m:23s to parse sprot38
generic (as a special case of all)
all
takes 23m:29s to parse sprot38
validate
with no callbacks takes 6m:41s
markup
SRS - http://www.lionbio.co.uk/
written in C (?)
have never used it, but it can definitely do count, fasta,
generic and all. The standard swissprot format definition
http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?
-page+LibInfo+-id+01FXMii+-lib+SWISSPROT
cannot be used to validate although SRS itself can. I
think SRS can be used to generate HTML markup but I can't
begin to guess how that might be done.
*** I really want to ask someone questions about SRS ***
*** Any takers? ***
I don't think it can be used to create your own indicies -
you must use its offset tables.
swissknife - ftp://ftp.ebi.ac.uk/pub/software/swissprot/
written in Perl
count
lazy reader takes 1m:48s to parse sprot38
fasta (getting the ->ID and ->SQ attributes)
takes 8m:47s to parse sprot38
generic (as a special case of all)
all
takes 38m:21s to parse sprot38
cannot be used to validate, markup
Biopy - http://shag.embl-heidelberg.de:8000/Biopy/
written in Python
count (as a special case of all)
index (by "position += length($_)")
fasta (as a special case of all)
generic (as a special case of all)
all - requires additional programming to parse the subfields
(it only identifies lines) so I actually wouldn't count
this as a full parser.
* takes roughly 25m to parse
cannot be used to validate, markup
Darwin - http://cbrg.inf.ethz.ch/Darwinshome.html
is its own language and set of libraries
contains a converter from SWISS-PROT to its own format.
I don't access to the source code so the following is based
on the example parser at
http://www.inf.ethz.ch/personal/hallett/drive/node92.html
count (as a special case of all)
fasta (as a special case of all)
generic (as a special case of all)
all - requires additional programming to parse the subfields
although the real implementation may contain all of that.
given example cannot be used to index, validate, markup
(Why does http://www.inf.ethz.ch/personal/hallett/drive/drive.html
say that SWISS-PROT 38 has only 77,977 record when my copy has
exactly 80,000?)
SeqIO - http://www.cs.ucdavis.edu/~gusfield/seqio.tar.gz
written in C
count (as a special case of generic)
fasta (as a special case of generic)
generic
have not yet benchmarked
cannot be used to index, all, validate, markup
readseq (C) - http://iubio.bio.indiana.edu/soft/molbio/readseq/
version1/readseq.shar
written in C
doesn't have swissprot and need to test of embl works instead
to be tested
readseq (Java) - http://iubio.bio.indiana.edu/soft/molbio/readseq/
java/readseq-source.zip
written in Java
have not yet explored (see above where I need help on how
to write a good test program in Java.)
Boulder - http://stein.cshl.org/software/boulder/
written in Perl
count (as a special case of generic)
fasta (as a special case of generic)
generic
have not yet benchmarked
cannot be used for index, all, validate, markup
molbio++ - ftp://ftp.ebi.ac.uk/pub/software/unix/molbio.tar.Z
written in (now obsolete) C++ which doesn't compile
I think it can be classified as
count (as a special case of generic)
fasta (as a special case of generic)
generic, although it calls for some extra parsing to get
at subfields of a data line
* will not be benchmarking since I don't want to spend
the effort to get it to compile.
cannot be used for index, all, validate, markup
BioDB-Loader - http://www.franz.com/services/conferences_seminars/
ismb2000/biodb1.tar.Z
written in Common Lisp (Help! I know even less lisp than Java!)
I'm guessing it can be classified as
count (as a special case of generic)
index
fasta (as a special case of generic)
generic, although it calls for some extra parsing to get
at the subfields of a data line
* have not benchmarked, although I have downloaded the Allegro
common Lisp demo version.
cannot be used for all, validate, markup
GCG - http://www.gcg.com/products/wis-package.html
written in C (?)
never used it. Betting it can be classified as
count (as a special case of generic)
index
fasta (as a special case of generic)
generic
have not benchmarked since I'm not spending that much
money just to test the performance.
cannot be used for all, validate, markup
sp2fasta - part of ftp://ftp.ncbi.nlm.nih.gov/toolkit/ ?
Can't seem to find it in the current distribution. Various
web pages imply it is a C program to convert SWISS-PROT/EMBL
to FASTA.
count (if used together with grep and wc)
fasta
have not benchmarked since I cannot find code
cannot be used for index, generic, all, validate, markup
sw2xml - http://www.vsms.nottingham.ac.uk/biodom/software/
protsuite-user-dist/sw2xml-protbot.pl
written in Perl. It is a translation program from SWISS-PROT
to XML so some additional, though minor, XML coding is needed
to do the following.
count (as a special case of all)
fasta (as a special case of all)
generic (as a special case of all)
all
have not yet benchmarked
cannot be used to index, validate, markup (because of the 'tidy')
NiceProt - used at ExPASy
implementation information not available
only used to parse a single record
parses the data file but doesn't build a data structure (?)
so creation of fasta, generic and all require som modifications.
cannot be used to count, index, validate(?), markup
get-sprot-entry - used at ExPASy
implementation not available
can be used to markup a record (eg, see
http://expasy.cbr.nrc.ca/cgi-bin/get-sprot-entry?P52930 )
doesn't build data structures or convert to another format
so it cannot be used for anything else (true?)
Whew! I'ld be surprised if I really did miss some other
major style of parsing. Actually, I did - there are no
lex/yacc grammers for SWISS-PROT but I'm not surprised
because the lexing is strongly position dependent which
calls for tight, explicit, tricky communications with the
parser.
Any other suggestions?
Sincerely,
Andrew Dalke
dalke@acm.org