[Biojava-l] equality of proteins based on their aminoacid sequence signature
Andy Yates
ayates at ebi.ac.uk
Fri Mar 11 22:48:34 UTC 2011
Hi Francois,
So I've been thinking about this & if we add this to a small set of objects (compounds & compound sets) we can get sequence equality working. This will be done as part of the SequenceMixin class & we can do case sensitive & insensitive versions. We can also do some tricks WRT length and compound sets to reject a pair of sequences without the need to iterate through the sequence. The code will look like
SequenceMixin.sequenceEquality(dnaOne, dnaTwo);
or
SequenceMixin.sequenceEqualityIgnoreCase(dnaOne, dnaTwo);
Don't forget you can also use checksums like md5 & sha1 to calculate a value which should be very unlikely to clash (projects like InterPro use this technique to cache results against a very quick lookup). You can do this like:
MessageDigest m = MessageDigest.getInstance("MD5");
for(Compound c: seq) {
m.update(c.getShortName().getBytes());
}
BigInteger i = new BigInteger(1,m.digest());
String md5checksum = String.format("%1$032X", i);
HTH
Andy
On 10 Mar 2011, at 12:47, Andy Yates wrote:
> This is where the subject becomes murky & will probably mean that any code written for equals() & hashcode() will have to take them into account where present. However Sequence compound identity would still be available from another method but this will require an extension of the Sequence interface
>
> Andy
>
> On 10 Mar 2011, at 12:22, Francois Le Fevre wrote:
>
>> This could be great. But for me equals means only séquence identity and not features.
>>
>>
>>> Le 10 mars 2011 10:17, "Andy Yates" <ayates at ebi.ac.uk> a écrit :
>>>
>>> I cannot remember the reason why we decided to not include equality for these objects. It's not an unreasonable thing to want though. Assuming I have some time soon I can have a look into implementing it on AbstractCompound, AbstractSequence & the backing stores but it will be some time away. If anyone else wants to give it a shot ... :)
>>>
>>> Andy
>>>
>>> On 10 Mar 2011, at 01:04, Andreas Prlic wrote:
>>>
>>>> Hi François,
>>>>
>>>> you could try to compare the st...
>>>
>>> --
>>> Andrew Yates Ensembl Genomes Engineer
>>> EMBL-EBI Tel: +44-(0)1...
>>>
>>
>
> --
> Andrew Yates Ensembl Genomes Engineer
> EMBL-EBI Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
--
Andrew Yates Ensembl Genomes Engineer
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the Biojava-l
mailing list