[Biopython-dev] Changing Seq equality

Leighton Pritchard lpritc at scri.ac.uk
Tue Nov 24 16:26:34 UTC 2009


Hi,

Without wanting to get too philosophical, an issue to consider in this, in
addition to the technical problems outlined by Peter, is what do we *mean*
when we ask about equality of two sequences?

As Peter points out, there is something counterintuitive about the peptide
"ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is
because we know that the things that these sequences represent are not in
reality the same thing.

Likewise, two instances of a repeat sequence in a genome are not necessarily
the same conceptual item, even though they may have the same nucleotide
sequence.  Also, two CDS from different sources may have the same conceptual
translation, but the identical translations are arguably not the same
sequence, and in both these circumstances a test for string equality ignores
potentially significant between the physical/biological elements they
describe.  These particular cases would give false positives for equality
that could be 'gotchas' for the use in dictionaries that prompted this
discussion.

If we want to test for string equality of two sequences, we can already do
that explicitly and simply with str(s1) == str(s2).  Making this the default
behaviour for a string doesn't always conform to my own expectations of what
'equality' means for two sequences, because my expectation changes depending
on the task in hand.

An alternative reasonable test for equality might be whether the two
sequences represent the same sequence, so Seq("M", generic_protein) ==
Seq("ATG", generic_dna) might return True if we make some potentially dodgy
assumptions about reading frames, and consider that they conceptually
represent the same thing.  I think that it would a bad default behaviour,
and harder to implement than testing string equality, but equally reasonable
depending on what you think 'equality' means.

Another, equally reasonable, definition of two sequences being 'equal' is
that they share a locus tag or accession.  I test on this more frequently
than I do on sequence identity, but still think it's a bad idea to make it a
default test for sequence equality.

Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a
genome, you might consider them equal.

There are several equally reasonable and yet non-universal definitions of
'equality' for sequence comparisons, and we currently have the ability to
test simply but explicitly for equality on the basis of any of these as we
need to at the time.  I would prefer to see this requirement for an explicit
string comparison kept, and the test for object equality kept as the
default, because this never produces a false positive (and I value
specificity over sensitivity as a default ;) ).

Cheers,

L.


On 24/11/2009 11:30, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

[...]
 
> The problem is if we'd like Seq("ACGT") to be equal to
> Seq("ACGT", generic_dna) then both must have the
> same hash. Then, if we also want Seq("ACGT") and
> Seq("ACGT", generic_protein) to be equal, they too must
> have the same hash. This means Seq("ACGT", generic_dna)
> and Seq("ACGT",generic_protein) would have the same
> hash, and therefore must evaluate as equal (!). The
> natural consequence of this chain of logic is we would
> then have Seq("ACGT") == Seq("ACGT", generic_dna)
> == Seq("ACGT",generic_protein) == Seq("ACGT",...).
> You reach the same point if we require the string
> "ACGT" equals Seq("ACGT", some_alphabet)
> 
> i.e. Another option would be to base Seq equality
> and hashing on the sequence string only (ignoring
> the alphabet).
> 
> This would at least be a simple rule to remember (and
> would mean we could implement less than, greater than
> etc in the same way) but basically means we'd ignore
> the alphabet.

[...]

> Changing Seq equality like this would make Biopython
> much nicer to use for basic tasks. For example, my
> code (and the unit tests) often contains things like if
> str(seq1)==str(seq2).
> 
> If we want to make this change, it is quite a break to
> backwards compatibility. (It also has the downside that
> a DNA sequence ACGT and a protein sequence ACGT
> would evaluate as equal - probably not a big issue in
> practice but counter intuitive).



-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________



More information about the Biopython-dev mailing list