[Biojava-l] looking for datafile parsers

Andrew Dalke dalke@acm.org
Thu, 11 Jan 2001 05:32:21 -0700


Hello,

  I'm working on a parser generator as part of the
Biopython development.  It's getting towards completion
which means it's time to start writing papers about it.  :)
Indeed, my paper was accepted for a talk at the upcoming
Python conference.  One of the reviewers wanted more
information comparing my work to others in the field, so
I've been digging up related project.  I figure on writing
another paper for Bioinformatics which will include some
more of this information.

  The most similar program is SRS, which is also a parser
generator, although they are context free while my parser is
(mostly) regular.  I tried to get a copy of the reference
paper (from Meth.Enzy.) from the library but it was checked
out.  I would love it if someone would offer to answer a
few questions for my about it, and to run some benchmarks
to see how fast it parses swissprot38, say, as compared to
how long it takes the bioperl code to parse the same file.
Any takers?

  There are a few projects which allow users to specific
a format using a configuration description which can roughly
be classified as a regular expression pattern matcher sitting
on top of line type recognizer.  This includes Biopy and
BioDB-Loader as well as the current Biopython parser.  Another
class of projects uses a common data structure then implements
readers/writers to the different formats at the expense of
throwing away some data, such as bioperl and SeqIO.  Swissknife
is an example of a library which reads/writes from a single
format into a data format tailored specifically to that format.
A few are special case programs (grep, NiceProt, sp2fasta)
which do one and only one thing, although in the case of
sw2xml that one thing converts the format (SWISS-PROT) to
another format (XML) for which many tools are readily available.
Most of the packages throw away formatting information and
only store the physical data, although get-sprot-entry is a nice
example of why keeping presentation information is useful.
The program creates an HTML page which looks the same as the
original format except that various fields are marked up with
hyperlinks.  Finally, the project I've been working on, Martel,
lets you develop parsers which handle most, if not all, of
these cases.

I want to make sure I covered everything so I've been searching
for SWISS-PROT parsers as my prototypical example.  A
description of what I found is below.  If something major is
missing, please tell me.  If you can provide assistence with
the SRS, GCG, Java or Lisp parts, also please tell me.


 Here's a key to some of the notation I use in the listings below:
count == count the number of records in a database
offset == generate offsets into the file for fast indexing
fasta == extract data for FASTA (ID, AC and SQ fields)
generic == extract generic sequence data, usually as a
   data structure containing fields common to multiple formats
   but ignoring some SWISS-PROT specific fields
all == extract all fields
validate == validate that a record is in the correct format
markup == identifies fields and saves the layout data so as
    to allow HTML markup without otherwise changing the format
    (timings not given for markup since it will depend on the
     specific markup requested, and because only Martel and
     get-sprot-entry preserve markup)

Performance is measured against the 80,000 records of
swissprot38


grep - http://www.gnu.org/gnulist/production/grep.html
  written in C
  count (when used as "grep ^ID | wc")
     takes 0m:57s to parse sprot38
  offset (when used as "grep -b ^ID")
  cannot be used for fasta, generic, all, validate, markup

one really large regular expression  (here as a bit of humor)
  written in C
  cannot be used for count, offset, fasta, generic, all, markup
  can be used for validate in theory, but I haven't tested it

bioperl - http://www.bioperl.org/
  written in Perl
  count (as a special case of generic)
  fasta (as a special case of generic)
  generic
    takes 30m:13s to parse sprot38
  cannot be used for index (?), all, validate, markup

biopython - http://www.biopython.org/
  written in Python
  count (as a special case of all)
  fasta (as a special case of all)
  generic (as a special case of all)
  all
    takes 28m:55s to parse sprot38
  validate
  cannot be used for index(?), markup

biojava - http://www.biojava.org/
  written in Java
  unknown (have source but need to figure it out)
  performance unknown (don't know how to code in Java)

Martel - http://www.biopython.org/~dalke/Martel/
  written in Python with a C extension
  count
    RecordReader.StartsWith "ID" takes 1m28s to parse sprot38
  index
  fasta (standard format def. but only using the ID and SQ tags)
    takes 9m:23s to parse sprot38
  generic (as a special case of all)
  all
    takes 23m:29s to parse sprot38
  validate
    with no callbacks takes 6m:41s
  markup
  

SRS - http://www.lionbio.co.uk/
  written in C (?)
  have never used it, but it can definitely do count, fasta,
  generic and all.  The standard swissprot format definition
       http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?
       -page+LibInfo+-id+01FXMii+-lib+SWISSPROT
  cannot be used to validate although SRS itself can.  I
  think SRS can be used to generate HTML markup but I can't
  begin to guess how that might be done.
    *** I really want to ask someone questions about SRS ***
    *** Any takers? ***
  I don't think it can be used to create your own indicies - 
    you must use its offset tables.

swissknife - ftp://ftp.ebi.ac.uk/pub/software/swissprot/
  written in Perl
  count
    lazy reader takes 1m:48s to parse sprot38
  fasta (getting the ->ID and ->SQ attributes)
    takes 8m:47s to parse sprot38
  generic (as a special case of all)
  all
    takes 38m:21s to parse sprot38
  cannot be used to validate, markup

Biopy - http://shag.embl-heidelberg.de:8000/Biopy/
  written in Python
  count (as a special case of all)
  index (by "position += length($_)")
  fasta (as a special case of all)
  generic (as a special case of all)
  all - requires additional programming to parse the subfields
    (it only identifies lines) so I actually wouldn't count
    this as a full parser.
    * takes roughly 25m to parse
  cannot be used to validate, markup

Darwin - http://cbrg.inf.ethz.ch/Darwinshome.html
  is its own language and set of libraries
  contains a converter from SWISS-PROT to its own format.
  I don't access to the source code so the following is based
  on the example parser at
    http://www.inf.ethz.ch/personal/hallett/drive/node92.html
  count (as a special case of all)
  fasta (as a special case of all)
  generic (as a special case of all)
  all - requires additional programming to parse the subfields
    although the real implementation may contain all of that.
  given example cannot be used to index, validate, markup
(Why does http://www.inf.ethz.ch/personal/hallett/drive/drive.html
say that SWISS-PROT 38 has only 77,977 record when my copy has
exactly 80,000?)

SeqIO - http://www.cs.ucdavis.edu/~gusfield/seqio.tar.gz
  written in C
  count (as a special case of generic)
  fasta (as a special case of generic)
  generic
    have not yet benchmarked
  cannot be used to index, all, validate, markup

readseq (C) - http://iubio.bio.indiana.edu/soft/molbio/readseq/
                 version1/readseq.shar
  written in C
  doesn't have swissprot and need to test of embl works instead
  to be tested  

readseq (Java) - http://iubio.bio.indiana.edu/soft/molbio/readseq/
                    java/readseq-source.zip
  written in Java
  have not yet explored (see above where I need help on how
   to write a good test program in Java.)

Boulder - http://stein.cshl.org/software/boulder/
  written in Perl
  count (as a special case of generic)
  fasta (as a special case of generic)
  generic
    have not yet benchmarked
  cannot be used for index, all, validate, markup

molbio++ - ftp://ftp.ebi.ac.uk/pub/software/unix/molbio.tar.Z
  written in (now obsolete) C++ which doesn't compile
  I think it can be classified as
  count (as a special case of generic)
  fasta (as a special case of generic)
  generic, although it calls for some extra parsing to get
     at subfields of a data line
     * will not be benchmarking since I don't want to spend
        the effort to get it to compile.
  cannot be used for index, all, validate, markup

BioDB-Loader - http://www.franz.com/services/conferences_seminars/
                 ismb2000/biodb1.tar.Z

  written in Common Lisp (Help! I know even less lisp than Java!)
  I'm guessing it can be classified as
  count (as a special case of generic)
  index
  fasta (as a special case of generic)
  generic, although it calls for some extra parsing to get
     at the subfields of a data line
     * have not benchmarked, although I have downloaded the Allegro
        common Lisp demo version.
  cannot be used for all, validate, markup

GCG - http://www.gcg.com/products/wis-package.html
  written in C (?)
  never used it.  Betting it can be classified as
  count (as a special case of generic)
  index
  fasta (as a special case of generic)
  generic
    have not benchmarked since I'm not spending that much
    money just to test the performance.
  cannot be used for all, validate, markup

sp2fasta - part of ftp://ftp.ncbi.nlm.nih.gov/toolkit/ ?
  Can't seem to find it in the current distribution.  Various
  web pages imply it is a C program to convert SWISS-PROT/EMBL
  to FASTA.
  count (if used together with grep and wc)
  fasta
    have not benchmarked since I cannot find code
  cannot be used for index, generic, all, validate, markup

sw2xml - http://www.vsms.nottingham.ac.uk/biodom/software/
             protsuite-user-dist/sw2xml-protbot.pl
  written in Perl.  It is a translation program from SWISS-PROT
  to XML so some additional, though minor, XML coding is needed
  to do the following.
  count (as a special case of all)
  fasta (as a special case of all)
  generic (as a special case of all)
  all
    have not yet benchmarked
  cannot be used to index, validate, markup (because of the 'tidy')

NiceProt - used at ExPASy
  implementation information not available
  only used to parse a single record
  parses the data file but doesn't build a data structure (?)
  so creation of fasta, generic and all require som modifications.
  cannot be used to count, index, validate(?), markup

get-sprot-entry - used at ExPASy
  implementation not available
  can be used to markup a record (eg, see
    http://expasy.cbr.nrc.ca/cgi-bin/get-sprot-entry?P52930 )
  doesn't build data structures or convert to another format
    so it cannot be used for anything else (true?)


Whew!  I'ld be surprised if I really did miss some other
major style of parsing.  Actually, I did - there are no
lex/yacc grammers for SWISS-PROT but I'm not surprised
because the lexing is strongly position dependent which
calls for tight, explicit, tricky communications with the
parser.

Any other suggestions?

Sincerely,

                    Andrew Dalke
                    dalke@acm.org