[BioRuby] restriction enzyme module

Thu Apr 5 14:55:48 UTC 2007

Trevor,

On 2007/04/05, at 6:56, Trevor Wennblom wrote:
> On Apr 4, 2007, at 4:48 PM, Trevor Wennblom wrote:
>> == 9 ==
>>
>> Regarding to_re
>>
>> Did you ever get to_re to work correctly?  I seem to remember you had  
>> a pretty good patch a while back.  I'm still really itching for an  
>> official fix since there are many cases where Bio::RestrictionEnzyme  
>> returns the wrong cut fragments because of incorrect matching.
>>
>> Presently
>>
>>    puts Bio::Sequence::NA.new("tan").to_re  # => (?-mix:ta[atgc])
>>
>> Should be
>>
>>    puts Bio::Sequence::NA.new("tan").to_re  # => (?-mix:ta[atgcn])
>
> Oh right - actually it should be all inclusive, so "n" should match  
> "yrwskmbdhvnatgcu".
>
> In this case for the example giving:
>
>    puts Bio::Sequence::NA.new("tan").to_re  # => (?-mix:ta[yrwskmbdhvnatgcu])

I'll forward the patch I sent to you.
Do you think this is applicable?

Toshiaki

Begin forwarded message:
> From: Toshiaki Katayama <ktym at hgc.jp>
> Date: 2007年1月1日 15:10:11:JST
> To: Trevor Wennblom <trevor at corevx.com>
> Cc: staff at bioruby.org
> Subject: Re: to_re revisited
>
> Hi Trevor,
>
> Thank you for your CVS commits.
>
> On 2007/01/01, at 9:23, Trevor Wennblom wrote:
>> I noticed that 'to_re' doesn't yet match itself.  For example:
>>
>> irb(main):001:0> require 'bio'
>> => true
>> irb(main):002:0> seq = Bio::Sequence::NA.new('arg')
>> => "arg"
>> irb(main):003:0> seq =~ seq.to_re
>> => nil
>> irb(main):004:0> seq.to_re
>> => /a[ag]g/
>> irb(main):005:0> seq =~ /a[rag]g/
>> => 0
>>
>> Will there be any chance that this could be fixed before the next release?
>>
>> If you're interested we exchanged emails about this issue in 2005 with the subject: "Bio::NucleicAcid na.rb to_re bug"
>
>
> I found attached my mail which couldn't be delivered to you.
>
>
> RCS file: /home/repository/bioruby/bioruby/lib/bio/data/na.rb,v
> retrieving revision 0.20
> diff -c -r0.20 na.rb
> *** na.rb       8 Feb 2006 12:15:42 -0000       0.20
> --- na.rb       1 Jan 2007 05:52:14 -0000
> ***************
> *** 158,172 ****
>       end
>
>       def to_re(seq, rna = false)
> !       str = seq.to_s.downcase
> !       str.gsub!(/[^atgcu]/) { |base|
> !         NAMES[base] || '.'
> !       }
> !       if rna
> !         str.tr!("t", "u")
> !       end
> !       Regexp.new(str)
> !     end
>
>     end
>
> --- 158,186 ----
>       end
>
>       def to_re(seq, rna = false)
> !      replace = {
> !        'y' => '[tcy]',
> !        'r' => '[agr]',
> !        'w' => '[atw]',
> !        's' => '[gcw]',
> !        'k' => '[tgk]',
> !        'm' => '[acm]',
> !        'b' => '[tgcyskb]',
> !        'd' => '[atgrwkd]',
> !        'h' => '[atcwmyh]',
> !        'v' => '[agcmrsv]',
> !        'n' => '[atgcyrwskmbdhvn]'
> !      }
> !
> !      str = seq.to_s.downcase
> !      str.gsub!(/[^atgcu]/) { |base|
> !        replace[base] || base
> !      }
> !      if rna
> !        str.tr!("t", "u")
> !      end
> !      Regexp.new(str)
> !    end
>
>     end
>
>
> By using this, following match works.
>
>
> bioruby> seq = Bio::Sequence::NA.new('arg')
>   ==> "arg"
> bioruby> seq =~ seq.to_re
>   ==> 0
> bioruby> seq.to_re
>   ==> /a[agr]g/
>
>
> But following still does not work as expected.
> How do you think?
>
>
> bioruby> seq = Bio::Sequence.new('arg')
>   ==> #<Bio::Sequence:0x14307b4 @seq="arg">
> bioruby> seq.na
>   ==> Bio::Sequence::NA
> bioruby> p seq
> #<Bio::Sequence:0x14307b4 @moltype=Bio::Sequence::NA, @seq="arg">
>   ==> nil
> bioruby> seq.to_re
>   ==> /a[agr]g/
> bioruby> seq =~ seq.to_re
>   ==> false
> bioruby> seq.seq =~ seq.to_re
>   ==> 0
> bioruby> p seq.to_re.match(seq)
> #<MatchData:0x1429d38>
>   ==> nil
> bioruby> p seq.to_re.match(seq.seq)
> #<MatchData:0x1427718>
>   ==> nil
>
>
> Toshiaki
>
>
>
> Begin forwarded message:
>> From: MAILER-DAEMON at bioruby.org (Mail Delivery System)
>> Date: 2005年12月11日 13:37:45:JST
>> To: k at bioruby.org
>> Subject: Undelivered Mail Returned to Sender
>>
>> This is the Postfix program at host kumamushi.bioruby.org.
>>
>> I'm sorry to have to inform you that your message could not be
>> be delivered to one or more recipients. It's attached below.
>>
>> For further assistance, please send mail to <postmaster>
>>
>> If you do so, please include this problem report. You can
>> delete your own text from the attached returned message.
>>
>> 			The Postfix program
>>
>> <trevor at corevx.com>: host mail.corevx.com[207.7.108.149] said: 554 Service
>>     unavailable; Client host [218.223.192.34] blocked using
>>     sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>     (in reply to RCPT TO command)
>> Reporting-MTA: dns; kumamushi.bioruby.org
>> X-Postfix-Queue-ID: 77FD72120EC
>> X-Postfix-Sender: rfc822; k at bioruby.org
>> Arrival-Date: Sun, 11 Dec 2005 13:37:33 +0900 (JST)
>>
>> Final-Recipient: rfc822; trevor at corevx.com
>> Action: failed
>> Status: 5.0.0
>> Diagnostic-Code: X-Postfix; host mail.corevx.com[207.7.108.149] said: 554
>>     Service unavailable; Client host [218.223.192.34] blocked using
>>     sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>     (in reply to RCPT TO command)
>>
>> From: Toshiaki Katayama <k at bioruby.org>
>> Date: 2005年12月11日 13:37:31:JST
>> To: staff at bioruby.org
>> Cc: Pjotr Prins <pjotr at pckassa.com>, Trevor Wennblom <trevor at corevx.com>
>> Subject: Fwd: Undelivered Mail Returned to Sender
>>
>>
>> Hi all,
>>
>> All my attempt to send my code to Trevor is failed.
>> Could anyone let him know the following URL?
>>
>>   http://kumamushi.org/~k/tmp/na_to_re.txt
>>
>> Toshiaki Katayama
>>
>> Begin forwarded message:
>>
>>> From: MAILER-DAEMON at bioruby.org (Mail Delivery System)
>>> Date: 2005年12月11日 13:29:39:JST
>>> To: ktym at hgc.jp
>>> Subject: Undelivered Mail Returned to Sender
>>>
>>> This is the Postfix program at host kumamushi.bioruby.org.
>>>
>>> I'm sorry to have to inform you that your message could not be
>>> be delivered to one or more recipients. It's attached below.
>>>
>>> For further assistance, please send mail to <postmaster>
>>>
>>> If you do so, please include this problem report. You can
>>> delete your own text from the attached returned message.
>>>
>>> 			The Postfix program
>>>
>>> <trevor at corevx.com>: host mail.corevx.com[207.7.108.149] said: 554 Service
>>>     unavailable; Client host [218.223.192.34] blocked using
>>>     sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>>     (in reply to RCPT TO command)
>>> Reporting-MTA: dns; kumamushi.bioruby.org
>>> X-Postfix-Queue-ID: E9C0D2120BC
>>> X-Postfix-Sender: rfc822; ktym at hgc.jp
>>> Arrival-Date: Sun, 11 Dec 2005 13:29:37 +0900 (JST)
>>>
>>> Final-Recipient: rfc822; trevor at corevx.com
>>> Action: failed
>>> Status: 5.0.0
>>> Diagnostic-Code: X-Postfix; host mail.corevx.com[207.7.108.149] said: 554
>>>     Service unavailable; Client host [218.223.192.34] blocked using
>>>     sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>>     (in reply to RCPT TO command)
>>>
>>> From: Toshiaki Katayama <ktym at hgc.jp>
>>> Date: 2005年12月11日 13:29:35:JST
>>> To: Trevor Wennblom <trevor at corevx.com>
>>> Subject: Re: Bio::NucleicAcid na.rb to_re bug
>>>
>>>
>>> Hello Trevor,
>>>
>>> Ughh, how could you receive my e-mail .....?
>>>
>>> I put the massage and the code at
>>>
>>>   http://kumamushi.org/~k/tmp/na_to_re.txt
>>>
>>> Toshiaki
>>>
>>> On 2005/12/11, at 13:14, Toshiaki Katayama wrote:
>>>
>>>> Trevor,
>>>>
>>>> I think to embed all possible combinations would be most efficient, after all.
>>>> Please see attached file, as your mail server reject my e-mail probably because
>>>> most of the contents was code...
>>>>
>>>> On 2005/12/11, at 3:34, Trevor Wennblom wrote:
>>>>
>>>>> Toshiaki Katayama wrote:
>>>>>
>>>>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>>>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>>>>
>>>>>
>>>>>
>>>>> Great, I'll try to play with the ad-hoc code later this week.
>>>>>
>>>>> Thanks again for all your help Toshiaki!
>>>>> Trevor
>>>>
>>>> <na_to_re.txt.gz>
>>>
>>>
>>>
>>
>>
>>
>> On 2005/12/11, at 12:55, Toshiaki Katayama wrote:
>>> To embed all possible combinations would be most efficient, after all...
>>>
>>>     def to_re(seq, rna = false)
>>>       replace = {
>>>         'y' => '[tcy]',
>>>         'r' => '[agr]',
>>>         'w' => '[atw]',
>>>         's' => '[gcw]',
>>>         'k' => '[tgk]',
>>>         'm' => '[acm]',
>>>         'b' => '[tgcyskb]',
>>>         'd' => '[atgrwkd]',
>>>         'h' => '[atcwmyh]',
>>>         'v' => '[agcmrsv]',
>>>         'n' => '[atgcyrwskmbdhvn]'
>>>       }
>>>
>>>       str = seq.to_s
>>>       str.gsub!(/[^atgcu]/) { |base|
>>>         replace[base] || base
>>>       }
>>>       if rna
>>>         str.tr!("t", "u")
>>>       end
>>>       Regexp.new(str)
>>>     end
>>>
>>> On 2005/12/11, at 3:34, Trevor Wennblom wrote:
>>>
>>>> Toshiaki Katayama wrote:
>>>>
>>>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>>>
>>>>
>>>>
>>>> Great, I'll try to play with the ad-hoc code later this week.
>>>>
>>>> Thanks again for all your help Toshiaki!
>>>> Trevor
>>>
>>
>>
>>
>> On 2005/12/11, at 3:26, Toshiaki Katayama wrote:
>>> Trevor,
>>>
>>> I have commited the following code for the meanwhile.
>>>
>>>     def to_re(seq, rna = false)
>>>       str = seq.to_s
>>>       str.gsub!(/[^atgcu]/) { |base|
>>>         NAMES[base] || '.'
>>>       }
>>>       if rna
>>>         str.tr!("t", "u")
>>>       end
>>>       Regexp.new(str)
>>>     end
>>>
>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>
>>>     NAMES = {
>>>       'y'       => 'ct',
>>>       'r'       => 'ag',
>>>       'w'       => 'at',
>>>       's'       => 'cg',
>>>       'k'       => 'gt',
>>>       'm'       => 'ac',
>>>
>>>       'b'       => 'cgt',
>>>       'd'       => 'agt',
>>>       'h'       => 'act',
>>>       'v'       => 'acg',
>>>
>>>       'n'       => 'acgt',
>>>
>>>       'a'       => 'a',
>>>       't'       => 't',
>>>       'g'       => 'g',
>>>       'c'       => 'c',
>>>       'u'       => 'u',
>>>
>>>       'A'       => 'Adenine',
>>>       'T'       => 'Thymine',
>>>       'G'       => 'Guanine',
>>>       'C'       => 'Cytosine',
>>>       'U'       => 'Uracil',
>>>
>>>       'Y'       => 'pYrimidine',
>>>       'R'       => 'puRine',
>>>       'W'       => 'Weak',
>>>       'S'       => 'Strong',
>>>       'K'       => 'Keto',
>>>       'M'       => 'aroMatic',
>>>
>>>       'B'       => 'not A',
>>>       'D'       => 'not C',
>>>       'H'       => 'not G',
>>>       'V'       => 'not T',
>>>     }
>>>
>>>     def to_re(seq, rna = false)
>>>       str = seq.to_s
>>>       str.gsub!(/[^atgcu]/) { |base|
>>>         set = NAMES[base].dup || '.'
>>>         case set.size
>>>         when 1
>>>           # just pass the base
>>>         when 2
>>>           # add ambiguous base
>>>           set += base
>>> 	when 3
>>>           # add any combination of the ambiguous bases
>>>           set = combination_3c2(set)
>>>           set += base
>>> 	when 4
>>>           # use the knowledge (or implement recursive method)
>>>           set = "atgcyrwskmbdhv"
>>> 	end
>>>         set
>>>       }
>>>       if rna
>>>         str.tr!("t", "u")
>>>       end
>>>       Regexp.new(str)
>>>     end
>>>
>>>     private
>>>
>>>     def combination_3c2(str)
>>>       result = ''
>>>       rev = NAMES.invert
>>>       ary = str.split(//)
>>>       [ ary[0] + ary[1], ary[0] + ary[2], ary[1] + ary[2] ].each do |key|
>>>         result << rev[key]
>>>       end
>>>       result + str
>>>     end
>>>
>>> Toshiaki
>>>
>>> On 2005/12/11, at 2:37, Trevor Wennblom wrote:
>>>
>>>> Toshiaki Katayama wrote:
>>>>
>>>>> Trevor,
>>>>>
>>>>> In this sense, we need to include permutation of ambiguous bases recursively.
>>>>> e.g. b => [tgc] not just match /[btgc]/ but also /[bkystgc]/ right?
>>>>>
>>>>
>>>>
>>>> Yes, you're correct.  This is a bigger algorithm issue than I first realized.
>>>
>>
>>
>>
>>
>> On 2005/12/11, at 2:33, Toshiaki Katayama wrote:
>>> Trevor,
>>>
>>> In this sense, we need to include permutation of ambiguous bases recursively.
>>> e.g. b => [tgc] not just match /[btgc]/ but also /[bkystgc]/ right?
>>>
>>> Toshiaki
>>>
>>> On 2005/12/10, at 3:58, Trevor Wennblom wrote:
>>>
>>>> Trevor Wennblom wrote:
>>>>> For the purposes of the Regexp (untested), would you be interested in adding the following to NAMES:
>>>>>
>>>>>  'n' => '.',
>>>>>  'x' => '.?',
>>>>>
>>>>>  'N' => 'aNy',
>>>>>  'X' => 'any or deletion',
>>>>
>>>> Just to follow-up -
>>>> I see that NAMES presently has:
>>>> 'n' => '[atgc]',
>>>>
>>>> Which of course wont match 's', 'y', 'w', etc.  I'm trying to think of a case where 'n'=>'.' or 'x'=>'.?' may be problematic but I haven't been able to think of a problem with it yet.  Perhaps it would fail in situations where the user may have provided non-nucleotide characters .
>>>>
>>>> I suppose if we wanted to be more explicit we could do:
>>>> 'n' => '[atgcyrwskmbdhv]',
>>>> 'x' => '[atgcyrwskmbdhv]?',
>>>>
>>>> 'N' => 'aNy',
>>>> 'X' => 'any or deletion',
>>>>
>>>> I can come up with some tests for Regexp objects where 'x' may be involved to demonstrate it's utility.
>>>>
>>>> Thanks,
>>>> Trevor
>>>
>>
>>
>>
>>
>> On 2005/12/10, at 3:46, Trevor Wennblom wrote:
>>> Toshiaki Katayama wrote:
>>>>     def to_re(seq, rna = false)
>>>>       str = seq.to_s
>>>>       str.gsub!(/./) { |base|
>>>>         re = NAMES[base].dup || '.'
>>>>         if re[0] == ?[
>>>>           re[1,0] = base
>>>>         end
>>>>         re
>>>>       }
>>>>       if rna
>>>>         str.tr!("t", "u")
>>>>       end
>>>>       Regexp.new(str)
>>>>     end
>>>>
>>> ...
>>>> str.gsub!(/[^atgcu]/) { |base|
>>>>
>>>> would be much better!
>>>
>>> I haven't had a chance to test it, but just from looking at it seems very impressive Toshiaki!  I think it's good to go.
>>>
>>> For the purposes of the Regexp (untested), would you be interested in adding the following to NAMES:
>>>
>>>  'n' => '.',
>>>  'x' => '.?',
>>>
>>>  'N' => 'aNy',
>>>  'X' => 'any or deletion',
>>>
>>>
>>> Again, excellent work!
>>> Trevor
>>
>>
>>
>
>
>
>