[Biopython-dev] New Bio.SeqIO code

Michiel de Hoon mdehoon at c2b2.columbia.edu
Wed Nov 1 05:58:41 UTC 2006


Peter (BioPython Dev) wrote:
> With such a speed up, I'd guess you were using Bio.Fasta before.

Yes I was. I just went to the Biopython tutorial and used the stuff in
section 2.4. I didn't expect it to be *that* slow.

> I've noticed the same thing.  Are you dealing with NCBI style fasta 
> identifiers made up of several fields separated by "|" characters?

Yep.

>> For duplicate keys, there are at least four possibilities (raise an
>> exception, store only one of the keys, store neither of the keys
>> and don't raise an exception, store both after modifying one of the
>> keys). So this should also be an option.
> 
> Supporting all these options with an easy to understand interface
> looks too hard.
> 
> In my opinion if someone is trying to build a dictionary using
> repeated keys they have made a mistake (either in their datafile, or
> their record2key function) - so raising an exception is reasonable
> default behaviour (and is easy to code).

You're probably right. I'm fine with raising an exception.

>> In the File2SequenceDict above, answer[key] contains the complete 
>> record. Some people will want that. However, in my application I
>> only want to store the record.seq part in answer[key]. Somebody
>> else may want str(record.seq). So we'd also need a record2value
>> argument.
> 
> It does slightly undermine the "you only get SeqRecord objects" 
> principle.  On the other hand, its a simple addition that is easy to
> explain and implement.  I'm happy to add this.

The point I was trying to make is that for a File2SequenceDict function 
to be useful, it would end up being too complex. In the answer above, a 
user could also do answer[key].seq to get the part she wants, so maybe a 
record2value argument is not essential in practice.

Part of my opposition against the File2SequenceDict function is that it 
requires the parser to be called File2SequenceIterator (which I don't 
like as a name, but more about that some other time), which then leads 
to a File2SequenceList function, which is software bloat.

So, how about making the functionality of File2SequenceDict available as 
a todict() method to the iterator object returned by 
File2SequenceIterator, or, as a iterator2dict function?

--Michiel.



More information about the Biopython-dev mailing list