[Biojava-l] issue with translating codons with N

Fri Sep 20 15:53:37 UTC 2013

Hi Nick,

It depends on the number of translations you're doing. We did some quite large performance testing Pjotr Prins for his chapter in Evolutionary Genomics book and those changes where a direct result of that *. I think the right solution is to revert to using the map already available in the class

Andy

* I wish I still had the stats from before & after this change to show what the performance impact was.

On 20 Sep 2013, at 15:57, Nick England <nickengland at gmail.com> wrote:

> Andy and Andy,
> 
> Does the array really result in that much performance improvement from a
> hashmap? Rather than a huge empty array you could just use a char[][][] for
> the three codons if you want to do it that way. I think the problem is that
> the code for translating "N" as X isn't called if the array lookup returns
> a match. And the array lookup doesn't seem calibrated to cope with Ns (or
> indeed any ambiguous bases.)
> 
> And Andy Law yes, its using the int values (but used to use the numbers
> 1,2,3,4 from the looks of it.)
> 
>       public int compoundToInt(NucleotideCompound c) {
>            char b = c.getUpperedBase().charAt(0);
>            return (int)b;
> //            int v = -1;
> //            if('A' == b) {
> //                v = 1;
> //            }
> //            else if('C' == b) {
> //                v = 2;
> //            }
> //            else if('G' == b) {
> //                v = 3;
> //            }
> //            else if('T' == b || 'U' == b) {
> //                v = 4;
> //            }
> //            return v;
> }
> 
> 
> On 20 September 2013 15:51, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:
> 
>> Looking at the multipliers, I would hazard a guess that the *intent* is to
>> multiply the numbers 0,1,2,3 (ACGT) rather than the ASCII codes. Are you
>> sure the code uses ASCII values?
>> 
>> 
>> On 20 Sep 2013, at 15:16, Nick England <nickengland at gmail.com> wrote:
>> 
>>> Everyone,
>>> 
>>> I've stepped through with a debugger, and this is a bad bug.
>>> 
>>> The code to translate from RNA->Protein does the following:
>>> - Take the ASCII Value for the 3 RNA bases, and multiple the first pos by
>>> 16, second by 4 and third by 1 and add them up.
>>> - Assume there won't be any collisions.
>>> 
>>> Here are the values which it then uses:
>>> 
>>> A:65
>>> G:71
>>> C:67
>>> U:85
>>> N:78
>>> ANA: 1417
>>> CAU: 1417
>>> ANG: 1423
>>> CGA: 1423
>>> 
>>> Notice any hash collisions?
>>> 
>>> I don't get why this wasn't done in a standard JavaHashMap which would
>>> ensure that any collisions were resolved. This is a pretty critical bug
>> for
>>> a biology informatics package.
>>> 
>>> Nick
>>> 
>>> 
>>> On 20 September 2013 13:45, Nick England <nickengland at gmail.com> wrote:
>>> 
>>>> Hara,
>>>> 
>>>> Hmm this is rather odd. I get the same issue with that sequence with a
>>>> custom engine as well.
>>>> 
>>>> My code has:
>>>> Builder builder = new TranscriptionEngine.Builder();
>>>>   builder.initMet(false);
>>>>   builder.translateNCodons(true);
>>>>   builder.trimStop(false);
>>>>   TranscriptionEngine engine = builder.build();
>>>>   Sequence<AminoAcidCompound> seq=engine.translate(new
>>>> DNASequence("GTNTGTTAGTGT"));
>>>>   assertEquals("XC*C", seq.toString());
>>>>   Sequence<AminoAcidCompound> seq2=engine.translate(new
>>>> DNASequence("ANAANG"));
>>>>   System.out.println(seq2);
>>>> the first sequence translates as expected, but your sequence is
>>>> translating as HR, when it should be XX. This looks like a pretty bad
>> bug!
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>> On 19 September 2013 19:59, Hara Dilley <hdilley at sutrobio.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Is there an issue with the DNA Translation in biojava3.core?
>>>>> It appears that it wants to translate "N" in certain cases
>>>>> Executing:
>>>>> new
>>>>> 
>> DNASequence("ANAANG").getRNASequence().getProteinSequence().getSequenceAsString();
>>>>> will produce  aa HR.
>>>>> 
>>>>> thanks
>>>>> Hara
>>>>> 
>>>>> ________________________________
>>>>> 
>>>>> This email and any attachments thereto may contain private,
>> confidential,
>>>>> and privileged material for the sole use of the intended recipient. Any
>>>>> review, copying, or distribution of this email (or any attachments
>> thereto)
>>>>> by others is strictly prohibited. If you are not the intended
>> recipient,
>>>>> please contact the sender immediately and permanently delete the
>> original
>>>>> and any copies of this email and any attachments thereto.
>>>>> 
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> Later,
>> 
>> Andy
>> --------
>> Yada, yada, yada...
>> 
>> Disclaimer: This e-mail and any attachments are confidential and intended
>> solely for the use of the recipient(s) to whom they are addressed. If you
>> have received it in error, please destroy all copies and inform the sender.
>> 
>> 
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l