[Biopython] Indexing large sequence files

Thu Jun 18 12:04:04 UTC 2009

On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote:
> Hello, I depend on functionality provided by Fasta.index_file to index a
> large file (5 million sequences), too large to put in memory, and access it
> in a dictionary-like way. Newer versions of Biopython have removed (or
> hopefully moved) this functionality....

Hi again Cedar,

I've changed the subject line as I wanted to take this opportunity to ask
more about the background to your use case.

Do you only case about FASTA files? Might you also want to index
say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ
file?

Presumably you need random access to the file (and can't simply use
a for loop to treat it record by record).

Do you care about the time taken to build the index, the time to access
a record, or both?

Do you expect to actually use most of the records, or just a small fraction?

[This has important implications for the implementation - as it is
possible to avoid parsing the data into objects while indexing]

I personally did once use the Fasta.index_file function (several years
ago now) for ~5000 sequences. I found that rebuilding the indexes as
my dataset changed was a big hassle, and eventually switched to in
memory dictionaries. Now I was able to do this as the dataset wasn't
too big - and for that project it was much more sensible approach.

Peter