[Biopython-dev] New Bio.SeqIO code
Michiel de Hoon
mdehoon at c2b2.columbia.edu
Mon Oct 30 03:42:44 UTC 2006
Peter wrote:
> There are at least two important questions: What to use as the
> dictionary key (e.g. record.id) and how to deal with duplicate keys
> (e.g. use first/last record with that id, or simply abort).
>
> Rewriting File2SequenceDict() to use a simple dict would give something
> like this, where record2key is an optional user supplied function.
>
> def File2SequenceDict(..., record2key=None) :
> iterator = File2SequenceIterator(...)
> if record2key is None : record2key = lambda record : record.id
> answer = dict()
> for record in iterator :
> key = record2key(record)
> assert key not in answer, "Duplicate key"
> answer[key] = record
> return answer
>
> The record2key function is perhaps not needed - I was trying to make the
> function flexible. The duplicate key behaviour could also be an option.
>
I am using File2SequenceIterator in one of my scripts (thanks by the way
for that, my script is a lot faster now. I didn't do a rigorous timing,
but it's about a zillion times faster), and convert the iterator to a
dictionary using plain Python. If I were to use File2SequenceDict
instead, I would need the record2key argument, because in my application
I want only part of record.id as the key.
In the File2SequenceDict above, answer[key] contains the complete
record. Some people will want that. However, in my application I only
want to store the record.seq part in answer[key]. Somebody else may want
str(record.seq). So we'd also need a record2value argument.
For duplicate keys, there are at least four possibilities (raise an
exception, store only one of the keys, store neither of the keys and
don't raise an exception, store both after modifying one of the keys).
So this should also be an option.
You'll end up with a File2SequenceDict function that is more complicated
than the plain Python solution.
--Michiel.
More information about the Biopython-dev
mailing list