[Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
Laurent
lgautier at gmail.com
Wed Jun 9 12:28:20 UTC 2010
What about having a class instance instead ? This would let one change
the index storage system very easily.
For example, to use a dictionary:
Bio.SeqIO.index(keyval_map = dict() )
A minimal requirement for the instance 'keyval_map' passed would be to
implement the methods __getitem__(self, key) and __setitem__(self, key,
value), allowing the "duck typing" approach commonly found in Python.
An SQLite-based index would be a matter of having a class such as:
class KeyValSQLite(object):
def __init__(self, filename):
# create the database into file "filename"
pass
def __getitem__(self, key):
""" return the value """
# select whatever in something where key='<key>'...
pass
def __setitem__(self, key, value):
# update...
pass
The this would be a call like:
Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))
Now that you have the idea, getting a custom index based on BDB or
anything should be a breeze...
L.
On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> Hi all,
>
> Thanks for the lively discussion on the main list,
>
> http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
> ...
> http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html
>
> I've spent the afternoon updating my old branch which uses SQLite
> to store the record identifier to file offset mapping. Using the code
> on this branch, Bio.SeqIO.index() supports a new optional argument
> currently called "db" (other names I like including "cache", suggestions
> welcome):
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> The default (False) is not to use SQLite, but continue with an in
> memory Python dictionary. As long as you have enough RAM
> and don't plan to use the index at a later date, this will be fastest.
>
> If set to True or a filename, then an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).
>
> I'm still testing this, but the core of the work is done I think.
> Once we're happy with the public API, we can concentrate
> on things like the SQLite schema, and optimising the code.
>
> Peter
>
> P.S. I know it will need a little work to fail gracefully on Python 2.4
> when SQLite isn't installed.
>
More information about the Biopython-dev
mailing list