[Biopython] Indexing large sequence files

Fri Jun 19 09:03:40 EDT 2009

On Fri, Jun 19, 2009 at 1:42 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>> How does this following code work for you? It is all in memory,
>> no index files on disk. I've been testing it on uniprot_sprot.fasta
>> which has only 470369 records (this example takes about 8s),
>> but the same approach also works on a FASTQ file with seven
>> million records (taking about 1min). These times are to build
>> the index, and access two records for testing.
>
> I like this idea, and your algorithm to parse multiple times and
> avoid building an index at all.

Cool. It can be generalised as I said - I'm playing with an
implementation now. This approach wouldn't have been a
such a good idea in the early days of Biopython as it is still
a bit memory hungry - but it seems to work fine for millions
of records.

> As a longer term file indexing strategy for any type of SeqIO
> supported format, what do we think about SQLite support for
> BioSQL?

I like this idea - we'll have to sell it to Hilmar at BOSC 2009
next weekend as it would require another BioSQL schema.

> One of the ideas we've talked about before is revamping
> BioSQL internals to use SQLAlchemy, which would give us
> SQLite for free. This adds an additional Biopython dependency
> on SQLAlchemy for BioSQL work, but hopefully will move a lot
> of the MySQL/PostgreSQL specific work Peter and Cymon do
> into SQLAlchemy internals so we don't have to maintain it.

The Python SQLite wrapper sqlite3 should be DB-API 2.0
compliant, so we should be able to integrate it into our existing
BioSQL code fine. I see what you are getting at with the
SQLAlchemy thing but remain to be convinced. Let's talk about
this at BOSC 2009.

> Conceptually, I like this approach as it gradually introduces
> users to real persistent storage. This way if your problem moves
> from "index a file" to "index a file and also store other specific
> annotations," it's a small change in usage rather than a major
> switch.

You mean pushing BioSQL (perhaps with SQLite as the DB)
for indexing records? Sure - and as SQLite is included in
Python 2.5, it could make BioSQL much simpler to install and
use with Biopython (at least if we don't also need SQLAlchemy!)

> This could be a target for hacking next weekend if people are
> generally agreed that it's a good idea.

It is at very least worth a good debate.

Peter