[Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)?

Mon Jun 7 15:08:27 EDT 2010

On Jun 7, 2010, at 12:56 PM, Peter wrote:

> On Tue, Apr 13, 2010 at 11:26 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hello all,
>> 
>> Last year we had a brief disucssion about the Open Biological
>> Database Access (OBDA) indexing for "flat files" which BioPerl and
>> BioRuby at least still support (despite some confusion over the spec):
>> http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html
>> 
>> There may still be life in the current Berkeley DB (DBD) based OBDA
>> index, but with ever larger sequences files needed in next generation
>> sequencing this could be a problem. Is anyone finding problems with
>> the current BDB index scaling to larger files (with tens of millions of
>> entries to index)?
>> 
>> From the Biopython perspective, we have a small external incentive
>> to favour SQLite3 over BDB: The python standard library has
>> historically included a BerkleyDB module (bsddb) but it has been
>> deprecated in Python 3. On the other hand, all recent versions of
>> Python include SQLite3 support.
>> 
>> Those of you on the BioPerl or Biopython mailing lists will have heard
>> me mention the idea of using SQLite to hold a flat file index, e.g.
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html
>> http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html
>> 
>> From the BioPerl thread I know they are now looking at using SQLite to hold
>> a lookup table of file offsets. Are any of the other Bio* projects interested
>> in this approach? I'd idealy like us to agree something shared with all the
>> Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was
>> thinking something along these lines if we want to support an index for
>> multiple files - just three tables:
>> 
>> * meta - table with string key/values (in particular to hold a schema version
>> number, plus perhaps the tool which built the index)
>> 
>> * offsets - table with entry accessions, file number (key to next table),
>> file offset
>> 
>> * files - table with filenames, file type (e.g. FASTA), datestamp
>> (so we can spot if the index is older than the file and needs to be
>> updated), perhaps other things like if the file is compressed (gzip,
>> bz2, ...).
>> 
>> Of course, there are complications. For instance, calculating the offsets
>> when dealing with different file encodings and new lines. Mark Schreiber
>> raised this as a concern with Java (see open-bio-l thread linked to above,
>> email dated 2 Sept 2009). The new line issue could also affect Biopython,
>> but this may not be a real issue in practise unless moving indexes between
>> operating systems.
>> 
>> Regards,
>> 
>> Peter
>> (@Biopython)
> 
> Hi all,
> 
> We've been discussing this again on the Biopython mailing list, and
> the plan to store offsets in an SQLite3 database seems quite popular.
> In the short term I'm just aiming for indexing single files, but it does
> seem likely that many people would find multi-file indexing useful.
> What do the other Bio* projects think? Should we try to co-ordinate
> a specification for a common SQLite3 file indexing schema?
> 
> Regards,
> 
> Peter

We typically implement multifile indexing in bioperl (either via a directory or a list of files).  Not much point in limiting it to one file.

Have you looked at the OBDA standard? It is a good start along these lines, but I think it dwindled a bit.  Might be worth reworking and modernizing it to suit our needs.

chris