[BioRuby] Translate ambiguous sequence

Mon Sep 15 23:15:19 EDT 2008

Hi,

Thank you for comments.
 > (but I prefer to set check_start = true by default;
It was set to false for the default for just not to
change the default behavior and is ok to make true for me.
If the change of the interface is allowed,
I prefer that the unknown be later option, since
changing the unknown from 'X' is expected to be very rare,
and, in fact, it can be done just a gsub operation without
the help of the library.

> As for the ambiguity, your needs seems to be restricted
> only for the 3' end of the sequence, but there may be demands
> for translating 'n's in the sequence.

My need is not restricted to the 3' end, and also not restricted to
'N's but there are ten other IUPAC redundant codes.
The message on September 11 treated only on these situations
(where whole triplet is given but contain an ambiguity code)
but not conscious on the start and the 3' end translation of 2 base.

I agree that addition of all possible redundant determinate codes to  
the codon tables
is another way to resolve the ambiguity codes.
But the table will be quite large to support all the possible
combinations for all the tables (at least for human review),
and a generator should be written.
Expecting that sequences containing ambiguity is rare, I wrote the  
code that will
not impact the efficiency of translating sequence without ambiguity.
Apparently the code for ambiguity is quite expensive, but I do not  
expect translating
sequences containing so many ambiguity code that is problematic.
(High proportion of ambiguity in itself is ok if the sequence is not  
very long).
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

On 2008/09/15, at 21:12, Toshiaki Katayama wrote:

> Hi,
>
> * check_start
>
> As you suggested, the codon table object (Bio::CodonTable) holds a  
> list of
> start codons as a knowledge, but Bio::Sequence::NA#translate method  
> does not
> utilize it (it is also true for the stop codons).
>
> lib/bio/data/codontable.rb:
> ------------------------------------------------------------
>   # Create your own codon table by giving a Hash table of codons  
> and relevant
>   # amino acids.  You can also able to define the table's name as a  
> second
>   # argument.
>   #
>   # Two Arrays 'start' and 'stop' can be specified which contains a  
> list of
>   # start and stop codons used by 'start_codon?' and 'stop_codon?'  
> methods.
>   def initialize(hash, definition = nil, start = [], stop = [])
>     @table = hash
>     @definition = definition
>     @start = start
>     @stop = stop.empty? ? generate_stop : stop
>   end
> ------------------------------------------------------------
>
> So, the following your code should be included in someway
> (but I prefer to set check_start = true by default; and
> use 'first_codon' variable explicitly instead of naseq[0, 3]).
>
> ------------------------------------------------------------
> +    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
> ------------------------------------------------------------
>
>
> * ambiguity
>
> As for the ambiguity, your needs seems to be restricted
> only for the 3' end of the sequence, but there may be demands
> for translating 'n's in the sequence.
>
> As the Bio::Sequence::NA#translate accepts the codon table object
> of your own as the 2nd argument, and you can copy and override
> the default codon tables (#1 to #23; or you can define your own
> codon table from scratch), there may be another approach to define
> ambiguous translations by your own.
>
> ------------------------------------------------------------
> your_codon_table = Bio::CodonTable.copy(1)
> your_codon_table['cgn'] = 'R'
> your_codon_table['cg'] = 'R'
>
> aaseq = naseq.translate(frame, your_codon_table)
> ------------------------------------------------------------
>
> To do this, we only need to change the following lines
>
> lib/bio/sequence/na.rb (translate):
> ------------------------------------------------------------
>    nalen -= nalen % 3
>    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or  
> unknown}
> ------------------------------------------------------------
>
> to the below
>
> ------------------------------------------------------------
>    #nalen -= nalen % 3
>    aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or  
> unknown}
> ------------------------------------------------------------
>
> but may be with a toggle flag to enable/disable this feature.
>
> Regards,
> Toshiaki Katayama
>
>
>
> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote:
>
>> Hi,
>>
>> To further make translation compatible what is done between DNA  
>> entry and protein
>> entry in databases, I thought that special treatment of the start  
>> codon and
>> incomplete codons are necessary.
>>
>> Special treatment of the start codons are for those codons that is
>> translated to M only when it is used as the start codon and
>> a different amino acids if it is used as an internal codon within  
>> a CDS.
>> For example GUG is V if it is internal to the CDS, but it can also  
>> serve
>> as a start codon and in that case it encodes M.
>> To change the behavior, I think an option is required.
>>
>> Incomplete codons are seen at the end of incomplete CDS,  
>> presumably due to
>> cloning or sequencing strategy.
>> When there are 'cg' at the end of CDS that are translated to 'R'
>> as any nucleotide would make the codon translate as 'R'
>>
>> It seems the translation are added only if the amino acid can be  
>> specified and is not 'X'.
>> -- 
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ 
>> lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb
>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
>> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900
>> +++ bioruby-a/lib/bio/data/codontable.rb        2008-09-13  
>> 12:06:28.000000000 +0900
>> @@ -93,6 +93,23 @@
>>   def [](codon)
>>     @table[codon]
>>   end
>> +  def translate_ambiguity(codon, unknown = 'X')
>> +    triplet = codon + "NNN"
>> +    aa = nil
>> +    Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do| 
>> third|
>> +      Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each  
>> do|first|
>> +        Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each  
>> do|second|
>> +          if aa == nil
>> +            aa = @table[first+second+third]
>> +          elsif
>> +            aa != @table[first+second+third]
>> +            return unknown
>> +          end
>> +        end
>> +      end
>> +    end
>> +    aa
>> +  end
>>
>>   # Modify the codon table.  Use with caution as it may break hard  
>> coded
>>   # tables.  If you want to modify existing table, you should use  
>> copy
>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ 
>> lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb
>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
>> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900
>> +++ bioruby-a/lib/bio/data/na.rb        2008-09-13  
>> 12:06:28.000000000 +0900
>> @@ -182,6 +182,13 @@
>>       end
>>       Regexp.new(str)
>>     end
>> +    def ambiguity2individual(na, rna = false)
>> +      str = NAMES[na.downcase].gsub(/[\[\]]/,"")
>> +      if rna
>> +        str.tr!("t", "u")
>> +      end
>> +      str.split(//)
>> +    end
>>
>>   end
>>
>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ 
>> lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb
>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
>> bio/sequence/na.rb     2008-09-03 22:24:39.000000000 +0900
>> +++ bioruby-a/lib/bio/sequence/na.rb    2008-09-15  
>> 18:57:19.000000000 +0900
>> @@ -231,7 +231,7 @@
>>   #   (default 1)
>>   # * (optional) _unknown_: Character (default 'X')
>>   # *Returns*:: Bio::Sequence::AA object
>> -  def translate(frame = 1, table = 1, unknown = 'X')
>> +  def translate(frame = 1, table = 1, unknown = 'X', check_start  
>> = false)
>>     if table.is_a?(Bio::CodonTable)
>>       ct = table
>>     else
>> @@ -251,8 +251,19 @@
>>       from = 0
>>     end
>>     nalen = naseq.length - from
>> -    nalen -= nalen % 3
>> -    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or  
>> unknown}
>> +#    nalen -= nalen % 3
>> +    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
>> +      if nalen > 3
>> +        aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {| 
>> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
>> +      else
>> +        aaseq = "M"
>> +      end
>> +    else
>> +      aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct 
>> [codon] or ct.translate_ambiguity(codon, unknown)}
>> +    end
>> +    if nalen % 3 != 0
>> +      aaseq.sub!(/X$/,"")
>> +    end
>>     return Bio::Sequence::AA.new(aaseq)
>>   end
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>