[Biopython-dev] New Bio.SeqIO code
Michiel de Hoon
mdehoon at c2b2.columbia.edu
Wed Nov 1 05:58:41 UTC 2006
Peter (BioPython Dev) wrote:
> With such a speed up, I'd guess you were using Bio.Fasta before.
Yes I was. I just went to the Biopython tutorial and used the stuff in
section 2.4. I didn't expect it to be *that* slow.
> I've noticed the same thing. Are you dealing with NCBI style fasta
> identifiers made up of several fields separated by "|" characters?
Yep.
>> For duplicate keys, there are at least four possibilities (raise an
>> exception, store only one of the keys, store neither of the keys
>> and don't raise an exception, store both after modifying one of the
>> keys). So this should also be an option.
>
> Supporting all these options with an easy to understand interface
> looks too hard.
>
> In my opinion if someone is trying to build a dictionary using
> repeated keys they have made a mistake (either in their datafile, or
> their record2key function) - so raising an exception is reasonable
> default behaviour (and is easy to code).
You're probably right. I'm fine with raising an exception.
>> In the File2SequenceDict above, answer[key] contains the complete
>> record. Some people will want that. However, in my application I
>> only want to store the record.seq part in answer[key]. Somebody
>> else may want str(record.seq). So we'd also need a record2value
>> argument.
>
> It does slightly undermine the "you only get SeqRecord objects"
> principle. On the other hand, its a simple addition that is easy to
> explain and implement. I'm happy to add this.
The point I was trying to make is that for a File2SequenceDict function
to be useful, it would end up being too complex. In the answer above, a
user could also do answer[key].seq to get the part she wants, so maybe a
record2value argument is not essential in practice.
Part of my opposition against the File2SequenceDict function is that it
requires the parser to be called File2SequenceIterator (which I don't
like as a name, but more about that some other time), which then leads
to a File2SequenceList function, which is software bloat.
So, how about making the functionality of File2SequenceDict available as
a todict() method to the iterator object returned by
File2SequenceIterator, or, as a iterator2dict function?
--Michiel.
More information about the Biopython-dev
mailing list