[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Mon Aug 31 13:24:51 UTC 2009

Hi Peter;

> The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
> as I would like some wider testing. My earlier email explained the
> implementation approach, and gave some example code:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html

Sweet. I pulled this from your branch earlier for something I was
doing at work and it's great stuff. My only suggestion would be to
change the function name to make it clear it's an in memory index.
This will clear us up for similar file based index functions.

> Another option (like the shelve idea we talked about last month)
> is to parse the sequence file with SeqIO, and serialise all the
> SeqRecord objects to disk, e.g. with pickle or some key/value
> database. This is potentially very complex (e.g. arbitrary Python
> objects in the annotation), and could lead to a very large "index"
> file on disk. On the other hand, some possible back ends would
> allow editing the database... which could be very useful.

My thought here was to use BioSQL and the SQLite mappings for
serializing. We build off a tested and existing serialization, and
also guide people into using BioSQL for larger projects.
Essentially, we would build an API on top of existing BioSQL
functionality that creates the index by loading the SQL and then
pushes the parsed records into it.

> Brad - do you have any thoughts? I know you did some work
> with key/value indexers:
> http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/

I've been using MongoDB (http://www.mongodb.org/display/DOCS/Home)
extensively and it rocks; it's fast and scales well. The bit of work
that is needed is translating objects into JSON representations. There
are object mappers like MongoKit (http://bitbucket.org/namlook/mongokit/)
that help with this.

Connecting these thoughts together, a rough two step development plan 
would be:

- Modify the underlying Biopython BioSQL representation to be object
  based, using SQLAlchemy. This is essentially what I'd suggested as
  a building block from Kyle's implementation.

- Use this to provide object mappings for object-based stores, like
  MongoDB/MongoKit or Google App Engine.

Brad