[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite

Wed Jun 9 12:28:20 UTC 2010

What about having a class instance instead ? This would let one change 
the index storage system very easily.

For example, to use a dictionary:

Bio.SeqIO.index(keyval_map = dict() )

A minimal requirement for the instance 'keyval_map' passed would be to 
implement the methods __getitem__(self, key) and __setitem__(self, key, 
value), allowing the "duck typing" approach commonly found in Python.

An SQLite-based index would be a matter of having a class such as:

class KeyValSQLite(object):
   def __init__(self, filename):
       # create the database into file "filename"
       pass

   def __getitem__(self, key):
       """ return the value """
       # select whatever in something where key='<key>'...
       pass

   def __setitem__(self, key, value):
       # update...
       pass

The this would be a call like:

Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))

Now that you have the idea, getting a custom index based on BDB or 
anything should be a breeze...

L.

On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> Hi all,
>
> Thanks for the lively discussion on the main list,
>
> http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
> ...
> http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html
>
> I've spent the afternoon updating my old branch which uses SQLite
> to store the record identifier to file offset mapping. Using the code
> on this branch, Bio.SeqIO.index() supports a new optional argument
> currently called "db" (other names I like including "cache", suggestions
> welcome):
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> The default (False) is not to use SQLite, but continue with an in
> memory Python dictionary. As long as you have enough RAM
> and don't plan to use the index at a later date, this will be fastest.
>
> If set to True or a filename, then an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).
>
> I'm still testing this, but the core of the work is done I think.
> Once we're happy with the public API, we can concentrate
> on things like the SQLite schema, and optimising the code.
>
> Peter
>
> P.S. I know it will need a little work to fail gracefully on Python 2.4
> when SQLite isn't installed.
>