[Biopython] Indexing large sequence files

Brad Chapman chapmanb at 50mail.com
Fri Jun 19 12:42:11 UTC 2009


Peter and Cedar;

> > So, for millions of records, I am going off the shelve/pickle
> > idea. Storing offsets in the original sequence file does seem
> > more practical here.

Agreed. Pickle is not great for this type of problem; it doesn't
scale at all.

> How does this following code work for you? It is all in memory,
> no index files on disk. I've been testing it on uniprot_sprot.fasta
> which has only 470369 records (this example takes about 8s),
> but the same approach also works on a FASTQ file with seven
> million records (taking about 1min). These times are to build
> the index, and access two records for testing.

I like this idea, and your algorithm to parse multiple times and
avoid building an index at all.

As a longer term file indexing strategy for any type of SeqIO
supported format, what do we think about SQLite support for BioSQL?
One of the ideas we've talked about before is revamping BioSQL internals
to use SQLAlchemy, which would give us SQLite for free. This adds an
additional Biopython dependency on SQLAlchemy for BioSQL work, but
hopefully will move a lot of the MySQL/PostgreSQL specific work
Peter and Cymon do into SQLAlchemy internals so we don't have to
maintain it.

Conceptually, I like this approach as it gradually introduces users to
real persistent storage. This way if your problem moves from "index a
file" to "index a file and also store other specific annotations,"
it's a small change in usage rather than a major switch.

This could be a target for hacking next weekend if people are
generally agreed that it's a good idea.

Brad



More information about the Biopython mailing list