[Biopython-dev] Order preservation in SeqIO/SearchIO indexes?

Peter Cock p.j.a.cock at googlemail.com
Mon Feb 6 15:44:23 UTC 2017

Hello all,

One of the interesting changes in Python 3.6 is that the dictionary
preserves the key order, which can be achieved in older versions
of Python by explicitly using the OrderedDict class available as:

from collections import OrderedDict

Biopython 1.69 will take advantage of this for the SeqRecord
feature qualifiers (i.e. annotation), allowing more faithful round
trip input and output for GenBank and EMBL formats:


In a related change, I am proposing we use the OrderedDict
for the SeqIO and SearchIO functions to_dict and index, which
currently use the default Python dictionary implementation.

One question is if this should be a configurable change,
for example with an extra optional argument to allow the
user to specify the dictionary object to be used, or just a
straightforward change with a note in the documentation?

It would then follow that the related index_db function also
be updated to return entries in the original order - which can
be done by sorting by the file offset. That may require adding a
new index to the database, which has performance implications.
e.g. Running with the new code on an old index file could be slow.

By preserving the record order as in the file, this kind of code
would see a major speed increase (by avoiding jumping back
and forth through the file with seek calls based on the arbitrary
ordering of the dictionary key hashes):

my_index = SeqIO.index(...)  # or index_db(...)
for key in my_index:
    record = my_index[key]
    # do stuff


my_index = SeqIO.index(...)  # or index_db(...)
for key, record in my_index.items():
    # do stuff



More information about the Biopython-dev mailing list