[Biopython-dev] ANN: mindy-0.1

Mon Mar 19 06:41:54 EST 2001

WARNING! First attempt at a generalized indexer for bioinformatics
formats using Martel.  All code experimental and subject to change
with even less notice than usual!

Code is available from
   http://www.biopython.org/~dalke/mindy-0.1.tar.gz

For the last few weeks I've been thinking about how to use Martel as
part of a generalized database indexer.  Martel of course does all the
required parsing, so it's a matter of converting the results into some
indexable format.

My first idea was to use the iterator interface to pull out a record
then pass it to XSLT to convert it into a indexed form.  Eg
 <record>
  <identifier>1433_CAEEL</identifier>
  <alias>P41932</alias>
  <alias>Q21537</alias>
   ...
 </record>

However, that proved slow because
  -  my XSLTs were taking roughly a second to process a record
       (although that was to convert the record into a SProt-like
        data structure; ie, convert all data into purely semantic XML)
  - I still don't know how to use XSLT very effectively
  - the iteratore interface is built on top of the callback interface
       so is slower

Instead I acknowledge the frequent common case where all needed fields
are strictly contained in an element and wrote a content handler which
lets you say something like

   The primary identifier is the content of the 'entry_name' element
   The record contains aliases, which are located in the 'ac_number'
   element.

I then made a command-line interface which works like

% mindy_index.py --format Martel.formats.swissprot38.format \
  --record-tag swissprot38_record --dbname swiss \
  --identifier entry_name --alias ac_number --progress 100\
  /home/dalke/ftps/swissprot/sprot38.dat

% mindy_search --dbname swiss --identifier 100K_RAT
ID   100K_RAT       STANDARD;      PRT;   889 AA.
AC   Q62671;
DT   01-NOV-1997 (Rel. 35, Created)
 ...

The indexing system uses Robin Dunn's bsddb3 interface on top of
Sleepycat Berkeley DB package.  You can get them from (respectively)

  http://pybsddb.sourceforge.net/
  http://www.sleepycat.com/download.html

To index my copy of swisprot38 using
  entry_name as the primary identifier
  ac_number as an alias
took
597.380u 145.080s 14:31.19 85.2%        0+0k 0+0io 55346pf+0w

so just under 15 minutes.  From previous timings, reading the database
and getting the id, ac and sequence fields takes about 9 minutes, so
the overhead specifically for indexing is roughly 6 minutes or 1/2 of
the time.  This is likely due to my inexperience in working with BSDDB
and can likely be reduced by a few minutes.

The final index data size is a but over 10MB:

% ls -l .mindy_dbhome/
total 9052
-rw-r-----    1 dalke    users        8192 Mar 19 11:08 __db.001
-rw-r-----    1 dalke    users      270336 Mar 19 11:08 __db.002
-rw-r-----    1 dalke    users      319488 Mar 19 11:08 __db.003
-rw-r--r--    1 dalke    users    10645504 Mar 19 11:08 swiss

The lookup time is very fast.  The command-line test is actually
limited by python's startup time.  I haven't tried timing inside of
Python.

% time env PYTHONPATH=/home/dalke/src python mindy_search.py --dbname
swiss --iden
tifier YU13_MYCTU --show-record=0 > /dev/null
0.130u 0.020s 0:00.19 78.9%     0+0k 0+0io 596pf+0w

% time env PYTHONPATH=/home/dalke/src python mindy_search.py --dbname
swiss --iden
tifier YU13_MYCTU --show-record=1 > /dev/null
0.130u 0.030s 0:00.19 84.2%     0+0k 0+0io 597pf+0w

For details on how to use the programs, run
  mindy_index.py --help
  mindy_search.py --help

The name "mindy" is derived from "Martel INDexer".

BUGS/TO DO/THOUGHTS:

There is no attempt at normalization, so searches are case and
whitespace sensitive.  This is easy to fix for the common case of
"string.lower everything and toss all ignorable whitespace".  I just
haven't done it.

The XSLT and Python function caller indexers are not implemented.

Most things aren't documented.

Haven't tested support for dealing with multiple files.

Would working with compressed files be useful?  (Even if slower for
record retrieval?)

Would like to be able to add new files to a database.

Would like to remove/update files in a database.

Would like to spawn off multiple indexers to take advantage of
multiprocessor machines - perhaps one indexer per file?  BSDDB can
support this sort of interface.

Haven't done any performance tuning.  Indeed, this is my first use of
BSDDB.

Haven't tested the 'keywords' section.

Could add a simple query language....

..But then more general purposing tools should be used (mySQL?
PostgreSQL?)

What about categories, like:
  name/* for any name
  name/swissprot-id for a swissprot-id
  reference/title contains "sequence analysis"
  xref/embl/embl-id is U05038
Okay, those really need a real database, although DOM/XPATH can
  handle some of them.  Hmm, see eXist.sourceforge.net and no
  doubt others.

Bugs section is incomplete :)

Enjoy!

                    Andrew
                    dalke at acm.org
P.S.
  I'm back from all my travels so I'll be catching up on
things (back email, bills, etc.) over the next few days.
Just thought you all would like to know if I end up sending
replies to old messages :)