[Biopython-dev] New Bio.SeqIO code
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Mon Oct 30 10:54:41 UTC 2006
Michiel de Hoon wrote:
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do ...
I've updated the new code in Bio.SeqIO to remove SequenceDict and
SequenceList and use the standard dictionary and list instead.
Michiel de Hoon wrote:
> I am using File2SequenceIterator in one of my scripts (thanks by the way
> for that, my script is a lot faster now. I didn't do a rigorous timing,
> but it's about a zillion times faster), and convert the iterator to a
> dictionary using plain Python. If I were to use File2SequenceDict
> instead, I would need the record2key argument, because in my application
> I want only part of record.id as the key.
With such a speed up, I'd guess you were using Bio.Fasta before. I've
noticed the same thing. Are you dealing with NCBI style fasta
identifiers made up of several fields separated by "|" characters?
> In the File2SequenceDict above, answer[key] contains the complete
> record. Some people will want that. However, in my application I only
> want to store the record.seq part in answer[key]. Somebody else may want
> str(record.seq). So we'd also need a record2value argument.
It does slightly undermine the "you only get SeqRecord objects"
principle. On the other hand, its a simple addition that is easy to
explain and implement. I'm happy to add this.
> For duplicate keys, there are at least four possibilities (raise an
> exception, store only one of the keys, store neither of the keys and
> don't raise an exception, store both after modifying one of the keys).
> So this should also be an option.
Supporting all these options with an easy to understand interface looks
too hard.
In my opinion if someone is trying to build a dictionary using repeated
keys they have made a mistake (either in their datafile, or their
record2key function) - so raising an exception is reasonable default
behaviour (and is easy to code).
Apart from the "exception" option, which of these actions do you
generally find most appropriate?
> You'll end up with a File2SequenceDict function that is more complicated
> than the plain Python solution.
Yes. Trying to do everything would be bad - both complicated to
implement, probably complicated to use as well.
Peter
More information about the Biopython-dev
mailing list