[Biopython-dev] Order preservation in SeqIO/SearchIO indexes?

Peter Cock p.j.a.cock at googlemail.com
Mon Feb 6 15:44:23 UTC 2017


Hello all,

One of the interesting changes in Python 3.6 is that the dictionary
preserves the key order, which can be achieved in older versions
of Python by explicitly using the OrderedDict class available as:

from collections import OrderedDict

Biopython 1.69 will take advantage of this for the SeqRecord
feature qualifiers (i.e. annotation), allowing more faithful round
trip input and output for GenBank and EMBL formats:

https://github.com/biopython/biopython/commit/c1f93f3b870b48c3483724abb1b045967feaae84
https://github.com/biopython/biopython/pull/987

In a related change, I am proposing we use the OrderedDict
for the SeqIO and SearchIO functions to_dict and index, which
currently use the default Python dictionary implementation.

One question is if this should be a configurable change,
for example with an extra optional argument to allow the
user to specify the dictionary object to be used, or just a
straightforward change with a note in the documentation?

It would then follow that the related index_db function also
be updated to return entries in the original order - which can
be done by sorting by the file offset. That may require adding a
new index to the database, which has performance implications.
e.g. Running with the new code on an old index file could be slow.

By preserving the record order as in the file, this kind of code
would see a major speed increase (by avoiding jumping back
and forth through the file with seek calls based on the arbitrary
ordering of the dictionary key hashes):

my_index = SeqIO.index(...)  # or index_db(...)
for key in my_index:
    record = my_index[key]
    # do stuff

Or,

my_index = SeqIO.index(...)  # or index_db(...)
for key, record in my_index.items():
    # do stuff

Thoughts/comments?

Peter


More information about the Biopython-dev mailing list