[Biopython-dev] New Bio.SeqIO code

Mon Oct 30 03:42:44 UTC 2006

Peter wrote:
> There are at least two important questions: What to use as the 
> dictionary key (e.g. record.id) and how to deal with duplicate keys 
> (e.g. use first/last record with that id, or simply abort).
> 

> Rewriting File2SequenceDict() to use a simple dict would give something 
> like this, where record2key is an optional user supplied function.
> 
> def File2SequenceDict(..., record2key=None) :
>     iterator = File2SequenceIterator(...)
>     if record2key is None : record2key = lambda record : record.id
>     answer = dict()
>     for record in iterator :
>         key = record2key(record)
>         assert key not in answer, "Duplicate key"
>         answer[key] = record
>     return answer
> 
> The record2key function is perhaps not needed - I was trying to make the 
> function flexible.  The duplicate key behaviour could also be an option.
> 
I am using File2SequenceIterator in one of my scripts (thanks by the way 
for that, my script is a lot faster now. I didn't do a rigorous timing, 
but it's about a zillion times faster), and convert the iterator to a 
dictionary using plain Python. If I were to use File2SequenceDict 
instead, I would need the record2key argument, because in my application 
I want only part of record.id as the key.

In the File2SequenceDict above, answer[key] contains the complete 
record. Some people will want that. However, in my application I only 
want to store the record.seq part in answer[key]. Somebody else may want 
str(record.seq). So we'd also need a record2value argument.

For duplicate keys, there are at least four possibilities (raise an 
exception, store only one of the keys, store neither of the keys and 
don't raise an exception, store both after modifying one of the keys). 
So this should also be an option.

You'll end up with a File2SequenceDict function that is more complicated 
than the plain Python solution.

--Michiel.