[Biopython-dev] New Bio.SeqIO code

Peter (BioPython Dev) biopython-dev at maubp.freeserve.co.uk
Mon Oct 30 10:54:41 UTC 2006


Michiel de Hoon wrote:
 > On a related note, I don't think we need the SequenceList and
 > SequenceDict class. To make a list, one can do ...

I've updated the new code in Bio.SeqIO to remove SequenceDict and 
SequenceList and use the standard dictionary and list instead.

Michiel de Hoon wrote:
> I am using File2SequenceIterator in one of my scripts (thanks by the way 
> for that, my script is a lot faster now. I didn't do a rigorous timing, 
> but it's about a zillion times faster), and convert the iterator to a 
> dictionary using plain Python. If I were to use File2SequenceDict 
> instead, I would need the record2key argument, because in my application 
> I want only part of record.id as the key.

With such a speed up, I'd guess you were using Bio.Fasta before. I've 
noticed the same thing.  Are you dealing with NCBI style fasta 
identifiers made up of several fields separated by "|" characters?

> In the File2SequenceDict above, answer[key] contains the complete 
> record. Some people will want that. However, in my application I only 
> want to store the record.seq part in answer[key]. Somebody else may want 
> str(record.seq). So we'd also need a record2value argument.

It does slightly undermine the "you only get SeqRecord objects" 
principle.  On the other hand, its a simple addition that is easy to 
explain and implement.  I'm happy to add this.

> For duplicate keys, there are at least four possibilities (raise an 
> exception, store only one of the keys, store neither of the keys and 
> don't raise an exception, store both after modifying one of the keys). 
> So this should also be an option.

Supporting all these options with an easy to understand interface looks 
too hard.

In my opinion if someone is trying to build a dictionary using repeated 
keys they have made a mistake (either in their datafile, or their 
record2key function) - so raising an exception is reasonable default 
behaviour (and is easy to code).

Apart from the "exception" option, which of these actions do you 
generally find most appropriate?

> You'll end up with a File2SequenceDict function that is more complicated 
> than the plain Python solution.

Yes.  Trying to do everything would be bad - both complicated to 
implement, probably complicated to use as well.

Peter




More information about the Biopython-dev mailing list