[Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)?

Tue Apr 13 06:26:40 EDT 2010

Hello all,

Last year we had a brief disucssion about the Open Biological
Database Access (OBDA) indexing for "flat files" which BioPerl and
BioRuby at least still support (despite some confusion over the spec):
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html

There may still be life in the current Berkeley DB (DBD) based OBDA
index, but with ever larger sequences files needed in next generation
sequencing this could be a problem. Is anyone finding problems with
the current BDB index scaling to larger files (with tens of millions of
entries to index)?

>From the Biopython perspective, we have a small external incentive
to favour SQLite3 over BDB: The python standard library has
historically included a BerkleyDB module (bsddb) but it has been
deprecated in Python 3. On the other hand, all recent versions of
Python include SQLite3 support.

Those of you on the BioPerl or Biopython mailing lists will have heard
me mention the idea of using SQLite to hold a flat file index, e.g.
http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html
http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html

>From the BioPerl thread I know they are now looking at using SQLite to hold
a lookup table of file offsets. Are any of the other Bio* projects interested
in this approach? I'd idealy like us to agree something shared with all the
Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was
thinking something along these lines if we want to support an index for
multiple files - just three tables:

* meta - table with string key/values (in particular to hold a schema version
number, plus perhaps the tool which built the index)

* offsets - table with entry accessions, file number (key to next table),
file offset

* files - table with filenames, file type (e.g. FASTA), datestamp
(so we can spot if the index is older than the file and needs to be
updated), perhaps other things like if the file is compressed (gzip,
bz2, ...).

Of course, there are complications. For instance, calculating the offsets
when dealing with different file encodings and new lines. Mark Schreiber
raised this as a concern with Java (see open-bio-l thread linked to above,
email dated 2 Sept 2009). The new line issue could also affect Biopython,
but this may not be a real issue in practise unless moving indexes between
operating systems.

Regards,

Peter
(@Biopython)