[Biopython] Sequence object "find" is still case specific?

Sun Mar 3 22:03:29 UTC 2013

We're going off topic here, but for the record I think
that 'find' should continue to be case sensitive (like
Python strings).

On Sun, Mar 3, 2013 at 9:10 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> The thing is, I am a bit unsure of the utility of alphabets associated with
> a Seq object in general. (And I was the one who was one of the original
> crafters of the Seq object). It seems like *any* letter is acceptable -
> there is no strict alphabet checking. I inserted "Z"s into an
> unambiguous-dna Seq object. So I am not sure when this happened, but aren't
> alphabets supposed to provide some constraints?

Not so far no, and I personally find this annoying. The
current alphabet system is quite heavy and not really
used to its full potential - checking the letters at __init__
time seems a good idea (when requested), likewise for
the MutableSeq object on edit.

Right now (kind of like duck-type-checking) Biopython looks
at the alphabet on demand, e.g. if trying to do a translation
or transcription. But for the most part, they are ignored.

My idea on https://redmine.open-bio.org/issues/2597 is
to continue with the current relaxed approach UNLESS the
alphabet selected has a letters attribute which would be
treated as a white list of allowed letters. What I would like
in the long run is to typically use the existing generic DNA,
RNA, nucleotide, protein alphabets where all you care about
is the type. Where you do care about the exact letters used,
then the strict IUPAC alphabets would apply (or subclasses
for special cases).

Perhaps we should actually finally do this in the next release
(or do a beta release with this enabled to see how many
complaints we get?).

Longer term, memory efficient bit-encoded Seq classes
(like BioJava has) would be interesting, and would fit
nicely with the strict letter checking approach.

Regards,

Peter