[BioRuby] Translate ambiguous sequence

Mon Sep 15 10:08:56 UTC 2008

Hi,

To further make translation compatible what is done between DNA entry  
and protein
entry in databases, I thought that special treatment of the start  
codon and
incomplete codons are necessary.

Special treatment of the start codons are for those codons that is
translated to M only when it is used as the start codon and
a different amino acids if it is used as an internal codon within a CDS.
For example GUG is V if it is internal to the CDS, but it can also serve
as a start codon and in that case it encodes M.
To change the behavior, I think an option is required.

Incomplete codons are seen at the end of incomplete CDS, presumably  
due to
cloning or sequencing strategy.
When there are 'cg' at the end of CDS that are translated to 'R'
as any nucleotide would make the codon translate as 'R'

It seems the translation are added only if the amino acid can be  
specified and is not 'X'.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb

--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
data/codontable.rb 2008-09-03 22:24:39.000000000 +0900
+++ bioruby-a/lib/bio/data/codontable.rb        2008-09-13  
12:06:28.000000000 +0900
@@ -93,6 +93,23 @@
    def [](codon)
      @table[codon]
    end
+  def translate_ambiguity(codon, unknown = 'X')
+    triplet = codon + "NNN"
+    aa = nil
+    Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third|
+      Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| 
first|
+        Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| 
second|
+          if aa == nil
+            aa = @table[first+second+third]
+          elsif
+            aa != @table[first+second+third]
+            return unknown
+          end
+        end
+      end
+    end
+    aa
+  end

    # Modify the codon table.  Use with caution as it may break hard  
coded
    # tables.  If you want to modify existing table, you should use copy
diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/data/na.rb bioruby-a/lib/bio/data/na.rb
--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
data/na.rb 2008-09-03 22:24:39.000000000 +0900
+++ bioruby-a/lib/bio/data/na.rb        2008-09-13 12:06:28.000000000  
+0900
@@ -182,6 +182,13 @@
        end
        Regexp.new(str)
      end
+    def ambiguity2individual(na, rna = false)
+      str = NAMES[na.downcase].gsub(/[\[\]]/,"")
+      if rna
+        str.tr!("t", "u")
+      end
+      str.split(//)
+    end

    end

diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb
--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
sequence/na.rb     2008-09-03 22:24:39.000000000 +0900
+++ bioruby-a/lib/bio/sequence/na.rb    2008-09-15 18:57:19.000000000  
+0900
@@ -231,7 +231,7 @@
    #   (default 1)
    # * (optional) _unknown_: Character (default 'X')
    # *Returns*:: Bio::Sequence::AA object
-  def translate(frame = 1, table = 1, unknown = 'X')
+  def translate(frame = 1, table = 1, unknown = 'X', check_start =  
false)
      if table.is_a?(Bio::CodonTable)
        ct = table
      else
@@ -251,8 +251,19 @@
        from = 0
      end
      nalen = naseq.length - from
-    nalen -= nalen % 3
-    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or  
unknown}
+#    nalen -= nalen % 3
+    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
+      if nalen > 3
+        aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon|  
ct[codon] or ct.translate_ambiguity(codon, unknown)}
+      else
+        aaseq = "M"
+      end
+    else
+      aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon]  
or ct.translate_ambiguity(codon, unknown)}
+    end
+    if nalen % 3 != 0
+      aaseq.sub!(/X$/,"")
+    end
      return Bio::Sequence::AA.new(aaseq)
    end