[Biopython-dev] Changing Seq equality

Jose Blanca jblanca at btc.upv.es
Wed Nov 25 08:45:20 UTC 2009


Hi,

> Without wanting to get too philosophical, an issue to consider in this, in
> addition to the technical problems outlined by Peter, is what do we *mean*
> when we ask about equality of two sequences?
>
> As Peter points out, there is something counterintuitive about the peptide
> "ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is
> because we know that the things that these sequences represent are not in
> reality the same thing.
>
> Likewise, two instances of a repeat sequence in a genome are not
> necessarily the same conceptual item, even though they may have the same
> nucleotide sequence.  Also, two CDS from different sources may have the
> same conceptual translation, but the identical translations are arguably
> not the same sequence, and in both these circumstances a test for string
> equality ignores potentially significant between the physical/biological
> elements they describe.  These particular cases would give false positives
> for equality that could be 'gotchas' for the use in dictionaries that
> prompted this discussion.
>
> If we want to test for string equality of two sequences, we can already do
> that explicitly and simply with str(s1) == str(s2).  Making this the
> default behaviour for a string doesn't always conform to my own
> expectations of what 'equality' means for two sequences, because my
> expectation changes depending on the task in hand.
>
> An alternative reasonable test for equality might be whether the two
> sequences represent the same sequence, so Seq("M", generic_protein) ==
> Seq("ATG", generic_dna) might return True if we make some potentially dodgy
> assumptions about reading frames, and consider that they conceptually
> represent the same thing.  I think that it would a bad default behaviour,
> and harder to implement than testing string equality, but equally
> reasonable depending on what you think 'equality' means.
>
> Another, equally reasonable, definition of two sequences being 'equal' is
> that they share a locus tag or accession.  I test on this more frequently
> than I do on sequence identity, but still think it's a bad idea to make it
> a default test for sequence equality.
>
> Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a
> genome, you might consider them equal.
>
> There are several equally reasonable and yet non-universal definitions of
> 'equality' for sequence comparisons, and we currently have the ability to
> test simply but explicitly for equality on the basis of any of these as we
> need to at the time.  I would prefer to see this requirement for an
> explicit string comparison kept, and the test for object equality kept as
> the default, because this never produces a false positive (and I value
> specificity over sensitivity as a default ;) ).

You're right, there's a lot of corner cases that I hadn't considered. I think 
of a Seq as an str with an alphabet so I wouldn't mind for some things, like 
the genome location of the Seq. But anyway, I use the __eq__ method as a 
convenience to avoid writting str(seq1) == str(seq2).
I'm aware that all abstractions leak, but that's not good or bad in itself. 
The abstraction is a model of the reality, as a model it won't be a perfect 
representation of the reality, just a convenient model. The abstraction is 
suited to a particular use, so its behaviour should be tailored to this use. 
I would implement this behaviour and document it's gotchas. Not implementing 
the behaviour because the abstraction leak also prevents the most general 
case that the abstraction is trying to cover.

> On 24/11/2009 11:30, "Peter" <biopython at maubp.freeserve.co.uk> wrote:
>
> [...]
>
> > The problem is if we'd like Seq("ACGT") to be equal to
> > Seq("ACGT", generic_dna) then both must have the
> > same hash. Then, if we also want Seq("ACGT") and
> > Seq("ACGT", generic_protein) to be equal, they too must
> > have the same hash. This means Seq("ACGT", generic_dna)
> > and Seq("ACGT",generic_protein) would have the same
> > hash, and therefore must evaluate as equal (!). The
> > natural consequence of this chain of logic is we would
> > then have Seq("ACGT") == Seq("ACGT", generic_dna)
> > == Seq("ACGT",generic_protein) == Seq("ACGT",...).
> > You reach the same point if we require the string
> > "ACGT" equals Seq("ACGT", some_alphabet)

Oh! I didn't know that! It's great to learn new python things!
I'm being naive here because I just have a swallow understanding of the 
problem, but here are my two cents.
would it be possible to generate the hashes and the __eq__ taking into account 
the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and 
ProteinAlphabet=2. So to check if two sequences we would do something like:
'ACGT1' == 'ACGT2'

Regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)



More information about the Biopython-dev mailing list