[BioRuby] Translate ambiguous sequence

Tue Sep 16 04:56:14 UTC 2008

Hi,

> It was set to false for the default for just not to
> change the default behavior and is ok to make true for me.

I just thought that if the main application of the 'translate'
method is to translate gene to protein sequence, current
implementation is incomplete and should be changed.
If not, retain the current behavior may be better.

> If the change of the interface is allowed,
> I prefer that the unknown be later option, since
> changing the unknown from 'X' is expected to be very rare,
> and, in fact, it can be done just a gsub operation without
> the help of the library.

I can agree (don't know how others think, though).
Another option is to provide different methods (interfaces)
for considering start/stop codons and ambiguous bases.
Or introduce named options...

> My need is not restricted to the 3' end, and also not restricted to
> 'N's but there are ten other IUPAC redundant codes.

Sorry, I misunderstood your code.

You are trying to translate all possible combinations of the ambiguous
bases on the fly.

Your code is fine and followings are just for discussion:

Is there no efficient way to statically generate a reduction of
the given codon table considering ambiguous bases...?

Your implementation seems to return 'unknown' if the translation of
the codon containing ambiguous bases are translated to the different
amino acid, however, the comparison occurs every time when the codon
is passed to the 'translate_ambiguity' method.

It would be helpful to know how many patterns needed to be generated
to match codons with ambiguous bases for 20 amino acids.

Is it possible to rewrite current Bio::CodonTable implementation
to utilize Regexp as a key for the codon table hash for this purpose?

Regards,
Toshiaki Katayama

On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote:

> Hi,
>
> Thank you for comments.
> > (but I prefer to set check_start = true by default;
> It was set to false for the default for just not to
> change the default behavior and is ok to make true for me.
> If the change of the interface is allowed,
> I prefer that the unknown be later option, since
> changing the unknown from 'X' is expected to be very rare,
> and, in fact, it can be done just a gsub operation without
> the help of the library.
>
>> As for the ambiguity, your needs seems to be restricted
>> only for the 3' end of the sequence, but there may be demands
>> for translating 'n's in the sequence.
>
>
> My need is not restricted to the 3' end, and also not restricted to
> 'N's but there are ten other IUPAC redundant codes.
> The message on September 11 treated only on these situations
> (where whole triplet is given but contain an ambiguity code)
> but not conscious on the start and the 3' end translation of 2 base.
>
> I agree that addition of all possible redundant determinate codes to the codon tables
> is another way to resolve the ambiguity codes.
> But the table will be quite large to support all the possible
> combinations for all the tables (at least for human review),
> and a generator should be written.
> Expecting that sequences containing ambiguity is rare, I wrote the code that will
> not impact the efficiency of translating sequence without ambiguity.
> Apparently the code for ambiguity is quite expensive, but I do not expect translating
> sequences containing so many ambiguity code that is problematic.
> (High proportion of ambiguity in itself is ok if the sequence is not very long).
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2008/09/15, at 21:12, Toshiaki Katayama wrote:
>
>> Hi,
>>
>> * check_start
>>
>> As you suggested, the codon table object (Bio::CodonTable) holds a list of
>> start codons as a knowledge, but Bio::Sequence::NA#translate method does not
>> utilize it (it is also true for the stop codons).
>>
>> lib/bio/data/codontable.rb:
>> ------------------------------------------------------------
>>  # Create your own codon table by giving a Hash table of codons and relevant
>>  # amino acids.  You can also able to define the table's name as a second
>>  # argument.
>>  #
>>  # Two Arrays 'start' and 'stop' can be specified which contains a list of
>>  # start and stop codons used by 'start_codon?' and 'stop_codon?' methods.
>>  def initialize(hash, definition = nil, start = [], stop = [])
>>    @table = hash
>>    @definition = definition
>>    @start = start
>>    @stop = stop.empty? ? generate_stop : stop
>>  end
>> ------------------------------------------------------------
>>
>> So, the following your code should be included in someway
>> (but I prefer to set check_start = true by default; and
>> use 'first_codon' variable explicitly instead of naseq[0, 3]).
>>
>> ------------------------------------------------------------
>> +    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
>> ------------------------------------------------------------
>>
>>
>> * ambiguity
>>
>> As for the ambiguity, your needs seems to be restricted
>> only for the 3' end of the sequence, but there may be demands
>> for translating 'n's in the sequence.
>>
>> As the Bio::Sequence::NA#translate accepts the codon table object
>> of your own as the 2nd argument, and you can copy and override
>> the default codon tables (#1 to #23; or you can define your own
>> codon table from scratch), there may be another approach to define
>> ambiguous translations by your own.
>>
>> ------------------------------------------------------------
>> your_codon_table = Bio::CodonTable.copy(1)
>> your_codon_table['cgn'] = 'R'
>> your_codon_table['cg'] = 'R'
>>
>> aaseq = naseq.translate(frame, your_codon_table)
>> ------------------------------------------------------------
>>
>> To do this, we only need to change the following lines
>>
>> lib/bio/sequence/na.rb (translate):
>> ------------------------------------------------------------
>>   nalen -= nalen % 3
>>   aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
>> ------------------------------------------------------------
>>
>> to the below
>>
>> ------------------------------------------------------------
>>   #nalen -= nalen % 3
>>   aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown}
>> ------------------------------------------------------------
>>
>> but may be with a toggle flag to enable/disable this feature.
>>
>> Regards,
>> Toshiaki Katayama
>>
>>
>>
>> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote:
>>
>>> Hi,
>>>
>>> To further make translation compatible what is done between DNA entry and protein
>>> entry in databases, I thought that special treatment of the start codon and
>>> incomplete codons are necessary.
>>>
>>> Special treatment of the start codons are for those codons that is
>>> translated to M only when it is used as the start codon and
>>> a different amino acids if it is used as an internal codon within a CDS.
>>> For example GUG is V if it is internal to the CDS, but it can also serve
>>> as a start codon and in that case it encodes M.
>>> To change the behavior, I think an option is required.
>>>
>>> Incomplete codons are seen at the end of incomplete CDS, presumably due to
>>> cloning or sequencing strategy.
>>> When there are 'cg' at the end of CDS that are translated to 'R'
>>> as any nucleotide would make the codon translate as 'R'
>>>
>>> It seems the translation are added only if the amino acid can be specified and is not 'X'.
>>> -- 
>>> Tomoaki NISHIYAMA
>>>
>>> Advanced Science Research Center,
>>> Kanazawa University,
>>> 13-1 Takara-machi,
>>> Kanazawa, 920-0934, Japan
>>>
>>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb
>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900
>>> +++ bioruby-a/lib/bio/data/codontable.rb        2008-09-13 12:06:28.000000000 +0900
>>> @@ -93,6 +93,23 @@
>>>  def [](codon)
>>>    @table[codon]
>>>  end
>>> +  def translate_ambiguity(codon, unknown = 'X')
>>> +    triplet = codon + "NNN"
>>> +    aa = nil
>>> +    Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third|
>>> +      Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first|
>>> +        Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second|
>>> +          if aa == nil
>>> +            aa = @table[first+second+third]
>>> +          elsif
>>> +            aa != @table[first+second+third]
>>> +            return unknown
>>> +          end
>>> +        end
>>> +      end
>>> +    end
>>> +    aa
>>> +  end
>>>
>>>  # Modify the codon table.  Use with caution as it may break hard coded
>>>  # tables.  If you want to modify existing table, you should use copy
>>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb
>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900
>>> +++ bioruby-a/lib/bio/data/na.rb        2008-09-13 12:06:28.000000000 +0900
>>> @@ -182,6 +182,13 @@
>>>      end
>>>      Regexp.new(str)
>>>    end
>>> +    def ambiguity2individual(na, rna = false)
>>> +      str = NAMES[na.downcase].gsub(/[\[\]]/,"")
>>> +      if rna
>>> +        str.tr!("t", "u")
>>> +      end
>>> +      str.split(//)
>>> +    end
>>>
>>>  end
>>>
>>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb
>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb     2008-09-03 22:24:39.000000000 +0900
>>> +++ bioruby-a/lib/bio/sequence/na.rb    2008-09-15 18:57:19.000000000 +0900
>>> @@ -231,7 +231,7 @@
>>>  #   (default 1)
>>>  # * (optional) _unknown_: Character (default 'X')
>>>  # *Returns*:: Bio::Sequence::AA object
>>> -  def translate(frame = 1, table = 1, unknown = 'X')
>>> +  def translate(frame = 1, table = 1, unknown = 'X', check_start = false)
>>>    if table.is_a?(Bio::CodonTable)
>>>      ct = table
>>>    else
>>> @@ -251,8 +251,19 @@
>>>      from = 0
>>>    end
>>>    nalen = naseq.length - from
>>> -    nalen -= nalen % 3
>>> -    aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown}
>>> +#    nalen -= nalen % 3
>>> +    if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
>>> +      if nalen > 3
>>> +        aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
>>> +      else
>>> +        aaseq = "M"
>>> +      end
>>> +    else
>>> +      aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
>>> +    end
>>> +    if nalen % 3 != 0
>>> +      aaseq.sub!(/X$/,"")
>>> +    end
>>>    return Bio::Sequence::AA.new(aaseq)
>>>  end
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>