[Biopython-dev] Changing Seq equality

Wed Nov 25 12:53:14 UTC 2009

Hi all;
Interesting discussion on the equality issue.

> Dividing alphabets into those four groups would imply:
> 
> "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide)
> "ACG" != Seq("ACG", generic_rna)
> "ACG" != Seq("ACG", generic_dna)
> "ACG" != Seq("ACG", generic_protein)
> ...
> Seq("ACG") != Seq("ACG", generic_protein)
> 
> This has some non-intuitive behaviour. Also it doesn't take
> into account a number of corner cases (which could be better
> handled in the existing Seq objects I admit) - things like
> secondary structure alphabets (e.g. for proteins: coils, beta
> sheet, alpha helix) or reduced alphabets? (e.g. for proteins
> using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of
> the Murphy (2000) tables).

Instead of considering the most horrible edge cases, we should think
about the most common use cases and make those easy. Alphabets are a
bit overcomplicated and in practice are probably not being used to
represent these other potential alphabets. I may be simple minded in 
my programming, but have never seen the benefit of directly encoding
anything more complicated that DNA, RNA or proteins. The 3 things
I've used alphabets for are:

- Is it DNA, RNA or protein?
- Does a sequence match the alphabet? Checking input files.
- Being careful not to add DNA and protein. In practice, I don't
  really do this very often.

> We could consider a modified version of the string identity
> approach - make seq1==seq2 act as str(seq1)==str(seq2),
> but *also* look at the alphabets and if they are incompatible
> (using the existing rules used in addition etc) raise a Python
> warning. Right now this seems like quite a tempting idea to
> explore...

I like this with Jose's cases for the standard DNA, RNA, protein and
generic alphabets. So provide sequence + alphabet checking for
all of the common cases, and a warning plus just sequence checking
for the edge cases. So if you try and compare a DNA sequence and
your secondary structure alphabet, you will get a mismatch on the
sequences and a warning about incompatible alphabets.

Brad