[Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)?
biopython at maubp.freeserve.co.uk
Mon Jun 7 13:56:07 EDT 2010
On Tue, Apr 13, 2010 at 11:26 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hello all,
> Last year we had a brief disucssion about the Open Biological
> Database Access (OBDA) indexing for "flat files" which BioPerl and
> BioRuby at least still support (despite some confusion over the spec):
> There may still be life in the current Berkeley DB (DBD) based OBDA
> index, but with ever larger sequences files needed in next generation
> sequencing this could be a problem. Is anyone finding problems with
> the current BDB index scaling to larger files (with tens of millions of
> entries to index)?
> From the Biopython perspective, we have a small external incentive
> to favour SQLite3 over BDB: The python standard library has
> historically included a BerkleyDB module (bsddb) but it has been
> deprecated in Python 3. On the other hand, all recent versions of
> Python include SQLite3 support.
> Those of you on the BioPerl or Biopython mailing lists will have heard
> me mention the idea of using SQLite to hold a flat file index, e.g.
> From the BioPerl thread I know they are now looking at using SQLite to hold
> a lookup table of file offsets. Are any of the other Bio* projects interested
> in this approach? I'd idealy like us to agree something shared with all the
> Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was
> thinking something along these lines if we want to support an index for
> multiple files - just three tables:
> * meta - table with string key/values (in particular to hold a schema version
> number, plus perhaps the tool which built the index)
> * offsets - table with entry accessions, file number (key to next table),
> file offset
> * files - table with filenames, file type (e.g. FASTA), datestamp
> (so we can spot if the index is older than the file and needs to be
> updated), perhaps other things like if the file is compressed (gzip,
> bz2, ...).
> Of course, there are complications. For instance, calculating the offsets
> when dealing with different file encodings and new lines. Mark Schreiber
> raised this as a concern with Java (see open-bio-l thread linked to above,
> email dated 2 Sept 2009). The new line issue could also affect Biopython,
> but this may not be a real issue in practise unless moving indexes between
> operating systems.
We've been discussing this again on the Biopython mailing list, and
the plan to store offsets in an SQLite3 database seems quite popular.
In the short term I'm just aiming for indexing single files, but it does
seem likely that many people would find multi-file indexing useful.
What do the other Bio* projects think? Should we try to co-ordinate
a specification for a common SQLite3 file indexing schema?
More information about the Open-Bio-l