[BioRuby] Translate ambiguous sequence

Mon Sep 15 08:12:52 EDT 2008

Hi,

* check_start

As you suggested, the codon table object (Bio::CodonTable) holds a list of
start codons as a knowledge, but Bio::Sequence::NA#translate method does not
utilize it (it is also true for the stop codons).

lib/bio/data/codontable.rb:
------------------------------------------------------------
  # Create your own codon table by giving a Hash table of codons and relevant
  # amino acids.  You can also able to define the table's name as a second
  # argument.
  #
  # Two Arrays 'start' and 'stop' can be specified which contains a list of
  # start and stop codons used by 'start_codon?' and 'stop_codon?' methods.
  def initialize(hash, definition = nil, start = [], stop = [])
    @table = hash
    @definition = definition
    @start = start
    @stop = stop.empty? ? generate_stop : stop
  end
------------------------------------------------------------

So, the following your code should be included in someway
(but I prefer to set check_start = true by default; and
use 'first_codon' variable explicitly instead of naseq[0, 3]).

------------------------------------------------------------
+    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
------------------------------------------------------------

* ambiguity

As for the ambiguity, your needs seems to be restricted
only for the 3' end of the sequence, but there may be demands
for translating 'n's in the sequence.

As the Bio::Sequence::NA#translate accepts the codon table object
of your own as the 2nd argument, and you can copy and override
the default codon tables (#1 to #23; or you can define your own
codon table from scratch), there may be another approach to define
ambiguous translations by your own.

------------------------------------------------------------
your_codon_table = Bio::CodonTable.copy(1)
your_codon_table['cgn'] = 'R'
your_codon_table['cg'] = 'R'

aaseq = naseq.translate(frame, your_codon_table)
------------------------------------------------------------

To do this, we only need to change the following lines

lib/bio/sequence/na.rb (translate):
------------------------------------------------------------
   nalen -= nalen % 3
   aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
------------------------------------------------------------

to the below

------------------------------------------------------------
   #nalen -= nalen % 3
   aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown}
------------------------------------------------------------

but may be with a toggle flag to enable/disable this feature.

Regards,
Toshiaki Katayama

On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote:

> Hi,
>
> To further make translation compatible what is done between DNA entry and protein
> entry in databases, I thought that special treatment of the start codon and
> incomplete codons are necessary.
>
> Special treatment of the start codons are for those codons that is
> translated to M only when it is used as the start codon and
> a different amino acids if it is used as an internal codon within a CDS.
> For example GUG is V if it is internal to the CDS, but it can also serve
> as a start codon and in that case it encodes M.
> To change the behavior, I think an option is required.
>
> Incomplete codons are seen at the end of incomplete CDS, presumably due to
> cloning or sequencing strategy.
> When there are 'cg' at the end of CDS that are translated to 'R'
> as any nucleotide would make the codon translate as 'R'
>
> It seems the translation are added only if the amino acid can be specified and is not 'X'.
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb
> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900
> +++ bioruby-a/lib/bio/data/codontable.rb        2008-09-13 12:06:28.000000000 +0900
> @@ -93,6 +93,23 @@
>   def [](codon)
>     @table[codon]
>   end
> +  def translate_ambiguity(codon, unknown = 'X')
> +    triplet = codon + "NNN"
> +    aa = nil
> +    Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third|
> +      Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first|
> +        Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second|
> +          if aa == nil
> +            aa = @table[first+second+third]
> +          elsif
> +            aa != @table[first+second+third]
> +            return unknown
> +          end
> +        end
> +      end
> +    end
> +    aa
> +  end
>
>   # Modify the codon table.  Use with caution as it may break hard coded
>   # tables.  If you want to modify existing table, you should use copy
> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb
> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900
> +++ bioruby-a/lib/bio/data/na.rb        2008-09-13 12:06:28.000000000 +0900
> @@ -182,6 +182,13 @@
>       end
>       Regexp.new(str)
>     end
> +    def ambiguity2individual(na, rna = false)
> +      str = NAMES[na.downcase].gsub(/[\[\]]/,"")
> +      if rna
> +        str.tr!("t", "u")
> +      end
> +      str.split(//)
> +    end
>
>   end
>
> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb
> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb     2008-09-03 22:24:39.000000000 +0900
> +++ bioruby-a/lib/bio/sequence/na.rb    2008-09-15 18:57:19.000000000 +0900
> @@ -231,7 +231,7 @@
>   #   (default 1)
>   # * (optional) _unknown_: Character (default 'X')
>   # *Returns*:: Bio::Sequence::AA object
> -  def translate(frame = 1, table = 1, unknown = 'X')
> +  def translate(frame = 1, table = 1, unknown = 'X', check_start = false)
>     if table.is_a?(Bio::CodonTable)
>       ct = table
>     else
> @@ -251,8 +251,19 @@
>       from = 0
>     end
>     nalen = naseq.length - from
> -    nalen -= nalen % 3
> -    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
> +#    nalen -= nalen % 3
> +    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
> +      if nalen > 3
> +        aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
> +      else
> +        aaseq = "M"
> +      end
> +    else
> +      aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
> +    end
> +    if nalen % 3 != 0
> +      aaseq.sub!(/X$/,"")
> +    end
>     return Bio::Sequence::AA.new(aaseq)
>   end
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby