[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Tue Sep 1 09:56:26 EDT 2009

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>
>Peter wrote:
>> Using BioSQL in this way is a much more general tool than
>> simply "indexing a sequence file". It feels like a sledgehammer
>> to crack a nut. Also, do you expect it to scale well for 10 million
>> plus short reads? It may do, but on the other hand it may not.
>
> Agreed that it would introduce extra overhead for something like
> short reads. If you are talking about serializing SeqRecords, it
> would make sense to re-use what we have in BioSQL.

I wasn't talking about serialising SeqRecord objects. I agree
there is (almost) no point implementing new serialisation code
when we already have BioSQL.

> If you are talking about storing just file offsets, then a lightweight
> solution makes more sense.

Indeed.

> For me, the initial parse time to prepare an index is not as much
> of an issue since it happens once while queries on it will happen
> multiple times.

It depends on the expected work load - if you are thinking about
indexing a local copy of GenBank, but only expect to pull out a
few (hundred) records, then the index time may be longer than
the total access time.

But in general, if we are talking about saving the index to a file
(which can then be reloaded) I would agree, the up front cost to
prepare the index isn't critical.

On the subject of how to store a index off file offsets on disk,
I think the old Biopython Martel/Mindy indexing code used to
create OBDA style indexes (either simple flat files or BDB based).
We should certainly consider these for cross project compatibility,
or perhaps introduce a new OBDA version which might use
something like SQLite internally instead?
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html
http://lists.open-bio.org/pipermail/open-bio-l/2009-September/000567.html

Peter