[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Thu Aug 20 10:13:00 EDT 2009

On Thu, Aug 20, 2009 at 2:58 PM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
>
> I just have two suggestions:
>
> Since indexed_dict returns a dictionary-like object, it may make sense
> for the _IndexedSeqFileDict to inherit from a dict.

We'd have to override things like values() to prevent explosions in memory,
and just give a not implemented exception. But yes, good point.

> Another issue is whether we can fold indexed_dict and to_dict into one.
> Right now we have
>
> def to_dict(sequences, key_function=None) :
>
> def indexed_dict(filename, format, alphabet=None) :
>
> What if we have a single function "dictionary" that can take sequences, a
> handle, or a filename, and optionally the format, alphabet, key_function,
> and a parameter "indexed" that indicates if the file should be indexed or
> kept into memory? Or something like that.

I wondered about this, but there are a couple of important differences
between my file indexer, and the existing to_dict function.

For the Bio.SeqIO.to_dict() function, the optional key_function argument
maps a SeqRecord to the desired index (by default the record's id is used).
Supporting a key_function for indexing files in the same way would mean
every single record in the file must be parsed into a SeqRecord while
building the index. This is possible, but would really really slow things
down - and while I considered it, I don't like this idea at all. Instead each
format indexer has essentially got a "mini parser" which just extracts
the id string, so things are much much faster.

Also, the to_dict function can be used on any sequences - not
just from a file. They could be a list of SeqRecords, or a generator
expression filtering output from Bio.SeqIO.parse(). Anything at all
really.

Finally I had better explain my thoughts on indexing and handles versus
filenames. For the SeqIO (and AlignIO etc) parsers, and handle which
supports the basic read/readline/iteration functionality can be used.
For the indexed_dict() function as written, we need to keep the handle
open for as long as the dictionary is kept in memory. We also must have
a handle which supports seek and tell (e.g. not a urllib handle, or
compressed files). Finally, the mode the file was opened in can be
important (e.g. for SFF files universal read lines mode must not be
used). So while indexed_dict could take a file handle (instead of a
filename) there are a lot of provisos. I felt just taking a filename was
the simplest solution here.

> Otherwise, the code looks really nice. Thanks!

Great - thanks for your comments.

Peter