[BioRuby] Translate ambiguous sequence
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Tue Sep 16 02:38:37 EDT 2008
Hi,
> Is there no efficient way to statically generate a reduction of
> the given codon table considering ambiguous bases...?
>
> Your implementation seems to return 'unknown' if the translation of
> the codon containing ambiguous bases are translated to the different
> amino acid, however, the comparison occurs every time when the codon
> is passed to the 'translate_ambiguity' method.
>
> It would be helpful to know how many patterns needed to be generated
> to match codons with ambiguous bases for 20 amino acids.
Generation of the hash in itself is not very difficult, (just iterate
over all the
possible triplet and dinucleotides, with some assumption on the table)
and 174-195 keys are sufficient for each of preexisting codon tables.
(for 20 amino acids plus '*')
The benefit is usually quite low as there are
little ambiguity in the DNA sequence
(because low quality regions are deleted at an earlier process).
The hash might worth included for standard codontables when someone
are to directly process a large quantity of poor quality sequence
data. (Maybe 454 or Solexa?)
For codontable object that are copied and modified,
I expect there are little cases when the cost to generate that
table for ambiguity treatment is smaller than the on the fly comparison.
#!/usr/local/bin/ruby
require 'bio'
dnanucleotides = ['a', 'c', 'g', 't', 'y', 'r', 'w', 's', 'k', 'm',
'b', 'd', 'h', 'v', 'n']
tableary=Array.new
[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 21, 22, 23].each do
|tableno|
partialhash = Hash.new
dnanucleotides.each do |first|
dnanucleotides.each do |second|
dnaseq = Bio::Sequence::NA.new(first + second)
transl = dnaseq.translate(1,tableno)
if transl != 'X' and transl != ""
partialhash[dnaseq] = transl
end
dnanucleotides.each do |third|
dnaseq = Bio::Sequence::NA.new(first + second + third)
transl = dnaseq.translate(1,tableno)
if transl != 'X'
partialhash[dnaseq] = transl
end
end
end
end
puts "table#{tableno}: #{partialhash.size} patterns"
# p partialhash
end
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
On 2008/09/16, at 13:56, Toshiaki Katayama wrote:
> Hi,
>
>> It was set to false for the default for just not to
>> change the default behavior and is ok to make true for me.
>
> I just thought that if the main application of the 'translate'
> method is to translate gene to protein sequence, current
> implementation is incomplete and should be changed.
> If not, retain the current behavior may be better.
>
>> If the change of the interface is allowed,
>> I prefer that the unknown be later option, since
>> changing the unknown from 'X' is expected to be very rare,
>> and, in fact, it can be done just a gsub operation without
>> the help of the library.
>
> I can agree (don't know how others think, though).
> Another option is to provide different methods (interfaces)
> for considering start/stop codons and ambiguous bases.
> Or introduce named options...
>
>> My need is not restricted to the 3' end, and also not restricted to
>> 'N's but there are ten other IUPAC redundant codes.
>
> Sorry, I misunderstood your code.
>
> You are trying to translate all possible combinations of the ambiguous
> bases on the fly.
>
> Your code is fine and followings are just for discussion:
>
> Is there no efficient way to statically generate a reduction of
> the given codon table considering ambiguous bases...?
>
> Your implementation seems to return 'unknown' if the translation of
> the codon containing ambiguous bases are translated to the different
> amino acid, however, the comparison occurs every time when the codon
> is passed to the 'translate_ambiguity' method.
>
> It would be helpful to know how many patterns needed to be generated
> to match codons with ambiguous bases for 20 amino acids.
>
> Is it possible to rewrite current Bio::CodonTable implementation
> to utilize Regexp as a key for the codon table hash for this purpose?
>
> Regards,
> Toshiaki Katayama
>
>
> On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote:
>
>> Hi,
>>
>> Thank you for comments.
>>> (but I prefer to set check_start = true by default;
>> It was set to false for the default for just not to
>> change the default behavior and is ok to make true for me.
>> If the change of the interface is allowed,
>> I prefer that the unknown be later option, since
>> changing the unknown from 'X' is expected to be very rare,
>> and, in fact, it can be done just a gsub operation without
>> the help of the library.
>>
>>> As for the ambiguity, your needs seems to be restricted
>>> only for the 3' end of the sequence, but there may be demands
>>> for translating 'n's in the sequence.
>>
>>
>> My need is not restricted to the 3' end, and also not restricted to
>> 'N's but there are ten other IUPAC redundant codes.
>> The message on September 11 treated only on these situations
>> (where whole triplet is given but contain an ambiguity code)
>> but not conscious on the start and the 3' end translation of 2 base.
>>
>> I agree that addition of all possible redundant determinate codes
>> to the codon tables
>> is another way to resolve the ambiguity codes.
>> But the table will be quite large to support all the possible
>> combinations for all the tables (at least for human review),
>> and a generator should be written.
>> Expecting that sequences containing ambiguity is rare, I wrote the
>> code that will
>> not impact the efficiency of translating sequence without ambiguity.
>> Apparently the code for ambiguity is quite expensive, but I do not
>> expect translating
>> sequences containing so many ambiguity code that is problematic.
>> (High proportion of ambiguity in itself is ok if the sequence is
>> not very long).
>> --
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>>
>> On 2008/09/15, at 21:12, Toshiaki Katayama wrote:
>>
>>> Hi,
>>>
>>> * check_start
>>>
>>> As you suggested, the codon table object (Bio::CodonTable) holds
>>> a list of
>>> start codons as a knowledge, but Bio::Sequence::NA#translate
>>> method does not
>>> utilize it (it is also true for the stop codons).
>>>
>>> lib/bio/data/codontable.rb:
>>> ------------------------------------------------------------
>>> # Create your own codon table by giving a Hash table of codons
>>> and relevant
>>> # amino acids. You can also able to define the table's name as
>>> a second
>>> # argument.
>>> #
>>> # Two Arrays 'start' and 'stop' can be specified which contains
>>> a list of
>>> # start and stop codons used by 'start_codon?' and 'stop_codon?'
>>> methods.
>>> def initialize(hash, definition = nil, start = [], stop = [])
>>> @table = hash
>>> @definition = definition
>>> @start = start
>>> @stop = stop.empty? ? generate_stop : stop
>>> end
>>> ------------------------------------------------------------
>>>
>>> So, the following your code should be included in someway
>>> (but I prefer to set check_start = true by default; and
>>> use 'first_codon' variable explicitly instead of naseq[0, 3]).
>>>
>>> ------------------------------------------------------------
>>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
>>> ------------------------------------------------------------
>>>
>>>
>>> * ambiguity
>>>
>>> As for the ambiguity, your needs seems to be restricted
>>> only for the 3' end of the sequence, but there may be demands
>>> for translating 'n's in the sequence.
>>>
>>> As the Bio::Sequence::NA#translate accepts the codon table object
>>> of your own as the 2nd argument, and you can copy and override
>>> the default codon tables (#1 to #23; or you can define your own
>>> codon table from scratch), there may be another approach to define
>>> ambiguous translations by your own.
>>>
>>> ------------------------------------------------------------
>>> your_codon_table = Bio::CodonTable.copy(1)
>>> your_codon_table['cgn'] = 'R'
>>> your_codon_table['cg'] = 'R'
>>>
>>> aaseq = naseq.translate(frame, your_codon_table)
>>> ------------------------------------------------------------
>>>
>>> To do this, we only need to change the following lines
>>>
>>> lib/bio/sequence/na.rb (translate):
>>> ------------------------------------------------------------
>>> nalen -= nalen % 3
>>> aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or
>>> unknown}
>>> ------------------------------------------------------------
>>>
>>> to the below
>>>
>>> ------------------------------------------------------------
>>> #nalen -= nalen % 3
>>> aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or
>>> unknown}
>>> ------------------------------------------------------------
>>>
>>> but may be with a toggle flag to enable/disable this feature.
>>>
>>> Regards,
>>> Toshiaki Katayama
>>>
>>>
>>>
>>> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote:
>>>
>>>> Hi,
>>>>
>>>> To further make translation compatible what is done between DNA
>>>> entry and protein
>>>> entry in databases, I thought that special treatment of the
>>>> start codon and
>>>> incomplete codons are necessary.
>>>>
>>>> Special treatment of the start codons are for those codons that is
>>>> translated to M only when it is used as the start codon and
>>>> a different amino acids if it is used as an internal codon
>>>> within a CDS.
>>>> For example GUG is V if it is internal to the CDS, but it can
>>>> also serve
>>>> as a start codon and in that case it encodes M.
>>>> To change the behavior, I think an option is required.
>>>>
>>>> Incomplete codons are seen at the end of incomplete CDS,
>>>> presumably due to
>>>> cloning or sequencing strategy.
>>>> When there are 'cg' at the end of CDS that are translated to 'R'
>>>> as any nucleotide would make the codon translate as 'R'
>>>>
>>>> It seems the translation are added only if the amino acid can be
>>>> specified and is not 'X'.
>>>> --
>>>> Tomoaki NISHIYAMA
>>>>
>>>> Advanced Science Research Center,
>>>> Kanazawa University,
>>>> 13-1 Takara-machi,
>>>> Kanazawa, 920-0934, Japan
>>>>
>>>> diff -ru bioruby-
>>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/
>>>> codontable.rb bioruby-a/lib/bio/data/codontable.rb
>>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/
>>>> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900
>>>> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13
>>>> 12:06:28.000000000 +0900
>>>> @@ -93,6 +93,23 @@
>>>> def [](codon)
>>>> @table[codon]
>>>> end
>>>> + def translate_ambiguity(codon, unknown = 'X')
>>>> + triplet = codon + "NNN"
>>>> + aa = nil
>>>> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each
>>>> do|third|
>>>> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each
>>>> do|first|
>>>> + Bio::NucleicAcid.ambiguity2individual(triplet
>>>> [1..1]).each do|second|
>>>> + if aa == nil
>>>> + aa = @table[first+second+third]
>>>> + elsif
>>>> + aa != @table[first+second+third]
>>>> + return unknown
>>>> + end
>>>> + end
>>>> + end
>>>> + end
>>>> + aa
>>>> + end
>>>>
>>>> # Modify the codon table. Use with caution as it may break
>>>> hard coded
>>>> # tables. If you want to modify existing table, you should use
>>>> copy
>>>> diff -ru bioruby-
>>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/
>>>> na.rb bioruby-a/lib/bio/data/na.rb
>>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/
>>>> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900
>>>> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13
>>>> 12:06:28.000000000 +0900
>>>> @@ -182,6 +182,13 @@
>>>> end
>>>> Regexp.new(str)
>>>> end
>>>> + def ambiguity2individual(na, rna = false)
>>>> + str = NAMES[na.downcase].gsub(/[\[\]]/,"")
>>>> + if rna
>>>> + str.tr!("t", "u")
>>>> + end
>>>> + str.split(//)
>>>> + end
>>>>
>>>> end
>>>>
>>>> diff -ru bioruby-
>>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/
>>>> sequence/na.rb bioruby-a/lib/bio/sequence/na.rb
>>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/
>>>> bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900
>>>> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15
>>>> 18:57:19.000000000 +0900
>>>> @@ -231,7 +231,7 @@
>>>> # (default 1)
>>>> # * (optional) _unknown_: Character (default 'X')
>>>> # *Returns*:: Bio::Sequence::AA object
>>>> - def translate(frame = 1, table = 1, unknown = 'X')
>>>> + def translate(frame = 1, table = 1, unknown = 'X',
>>>> check_start = false)
>>>> if table.is_a?(Bio::CodonTable)
>>>> ct = table
>>>> else
>>>> @@ -251,8 +251,19 @@
>>>> from = 0
>>>> end
>>>> nalen = naseq.length - from
>>>> - nalen -= nalen % 3
>>>> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon]
>>>> or unknown}
>>>> +# nalen -= nalen % 3
>>>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3])
>>>> + if nalen > 3
>>>> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|
>>>> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)}
>>>> + else
>>>> + aaseq = "M"
>>>> + end
>>>> + else
>>>> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct
>>>> [codon] or ct.translate_ambiguity(codon, unknown)}
>>>> + end
>>>> + if nalen % 3 != 0
>>>> + aaseq.sub!(/X$/,"")
>>>> + end
>>>> return Bio::Sequence::AA.new(aaseq)
>>>> end
>>>>
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>
>>
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
More information about the BioRuby
mailing list