[BioRuby] restriction enzyme module
Toshiaki Katayama
ktym at hgc.jp
Thu Apr 5 14:55:48 UTC 2007
Trevor,
On 2007/04/05, at 6:56, Trevor Wennblom wrote:
> On Apr 4, 2007, at 4:48 PM, Trevor Wennblom wrote:
>> == 9 ==
>>
>> Regarding to_re
>>
>> Did you ever get to_re to work correctly? I seem to remember you had
>> a pretty good patch a while back. I'm still really itching for an
>> official fix since there are many cases where Bio::RestrictionEnzyme
>> returns the wrong cut fragments because of incorrect matching.
>>
>> Presently
>>
>> puts Bio::Sequence::NA.new("tan").to_re # => (?-mix:ta[atgc])
>>
>> Should be
>>
>> puts Bio::Sequence::NA.new("tan").to_re # => (?-mix:ta[atgcn])
>
> Oh right - actually it should be all inclusive, so "n" should match
> "yrwskmbdhvnatgcu".
>
> In this case for the example giving:
>
> puts Bio::Sequence::NA.new("tan").to_re # => (?-mix:ta[yrwskmbdhvnatgcu])
I'll forward the patch I sent to you.
Do you think this is applicable?
Toshiaki
Begin forwarded message:
> From: Toshiaki Katayama <ktym at hgc.jp>
> Date: 2007年1月1日 15:10:11:JST
> To: Trevor Wennblom <trevor at corevx.com>
> Cc: staff at bioruby.org
> Subject: Re: to_re revisited
>
> Hi Trevor,
>
> Thank you for your CVS commits.
>
> On 2007/01/01, at 9:23, Trevor Wennblom wrote:
>> I noticed that 'to_re' doesn't yet match itself. For example:
>>
>> irb(main):001:0> require 'bio'
>> => true
>> irb(main):002:0> seq = Bio::Sequence::NA.new('arg')
>> => "arg"
>> irb(main):003:0> seq =~ seq.to_re
>> => nil
>> irb(main):004:0> seq.to_re
>> => /a[ag]g/
>> irb(main):005:0> seq =~ /a[rag]g/
>> => 0
>>
>> Will there be any chance that this could be fixed before the next release?
>>
>> If you're interested we exchanged emails about this issue in 2005 with the subject: "Bio::NucleicAcid na.rb to_re bug"
>
>
> I found attached my mail which couldn't be delivered to you.
>
>
> RCS file: /home/repository/bioruby/bioruby/lib/bio/data/na.rb,v
> retrieving revision 0.20
> diff -c -r0.20 na.rb
> *** na.rb 8 Feb 2006 12:15:42 -0000 0.20
> --- na.rb 1 Jan 2007 05:52:14 -0000
> ***************
> *** 158,172 ****
> end
>
> def to_re(seq, rna = false)
> ! str = seq.to_s.downcase
> ! str.gsub!(/[^atgcu]/) { |base|
> ! NAMES[base] || '.'
> ! }
> ! if rna
> ! str.tr!("t", "u")
> ! end
> ! Regexp.new(str)
> ! end
>
> end
>
> --- 158,186 ----
> end
>
> def to_re(seq, rna = false)
> ! replace = {
> ! 'y' => '[tcy]',
> ! 'r' => '[agr]',
> ! 'w' => '[atw]',
> ! 's' => '[gcw]',
> ! 'k' => '[tgk]',
> ! 'm' => '[acm]',
> ! 'b' => '[tgcyskb]',
> ! 'd' => '[atgrwkd]',
> ! 'h' => '[atcwmyh]',
> ! 'v' => '[agcmrsv]',
> ! 'n' => '[atgcyrwskmbdhvn]'
> ! }
> !
> ! str = seq.to_s.downcase
> ! str.gsub!(/[^atgcu]/) { |base|
> ! replace[base] || base
> ! }
> ! if rna
> ! str.tr!("t", "u")
> ! end
> ! Regexp.new(str)
> ! end
>
> end
>
>
> By using this, following match works.
>
>
> bioruby> seq = Bio::Sequence::NA.new('arg')
> ==> "arg"
> bioruby> seq =~ seq.to_re
> ==> 0
> bioruby> seq.to_re
> ==> /a[agr]g/
>
>
> But following still does not work as expected.
> How do you think?
>
>
> bioruby> seq = Bio::Sequence.new('arg')
> ==> #<Bio::Sequence:0x14307b4 @seq="arg">
> bioruby> seq.na
> ==> Bio::Sequence::NA
> bioruby> p seq
> #<Bio::Sequence:0x14307b4 @moltype=Bio::Sequence::NA, @seq="arg">
> ==> nil
> bioruby> seq.to_re
> ==> /a[agr]g/
> bioruby> seq =~ seq.to_re
> ==> false
> bioruby> seq.seq =~ seq.to_re
> ==> 0
> bioruby> p seq.to_re.match(seq)
> #<MatchData:0x1429d38>
> ==> nil
> bioruby> p seq.to_re.match(seq.seq)
> #<MatchData:0x1427718>
> ==> nil
>
>
> Toshiaki
>
>
>
> Begin forwarded message:
>> From: MAILER-DAEMON at bioruby.org (Mail Delivery System)
>> Date: 2005年12月11日 13:37:45:JST
>> To: k at bioruby.org
>> Subject: Undelivered Mail Returned to Sender
>>
>> This is the Postfix program at host kumamushi.bioruby.org.
>>
>> I'm sorry to have to inform you that your message could not be
>> be delivered to one or more recipients. It's attached below.
>>
>> For further assistance, please send mail to <postmaster>
>>
>> If you do so, please include this problem report. You can
>> delete your own text from the attached returned message.
>>
>> The Postfix program
>>
>> <trevor at corevx.com>: host mail.corevx.com[207.7.108.149] said: 554 Service
>> unavailable; Client host [218.223.192.34] blocked using
>> sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>> (in reply to RCPT TO command)
>> Reporting-MTA: dns; kumamushi.bioruby.org
>> X-Postfix-Queue-ID: 77FD72120EC
>> X-Postfix-Sender: rfc822; k at bioruby.org
>> Arrival-Date: Sun, 11 Dec 2005 13:37:33 +0900 (JST)
>>
>> Final-Recipient: rfc822; trevor at corevx.com
>> Action: failed
>> Status: 5.0.0
>> Diagnostic-Code: X-Postfix; host mail.corevx.com[207.7.108.149] said: 554
>> Service unavailable; Client host [218.223.192.34] blocked using
>> sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>> (in reply to RCPT TO command)
>>
>> From: Toshiaki Katayama <k at bioruby.org>
>> Date: 2005年12月11日 13:37:31:JST
>> To: staff at bioruby.org
>> Cc: Pjotr Prins <pjotr at pckassa.com>, Trevor Wennblom <trevor at corevx.com>
>> Subject: Fwd: Undelivered Mail Returned to Sender
>>
>>
>> Hi all,
>>
>> All my attempt to send my code to Trevor is failed.
>> Could anyone let him know the following URL?
>>
>> http://kumamushi.org/~k/tmp/na_to_re.txt
>>
>> Toshiaki Katayama
>>
>> Begin forwarded message:
>>
>>> From: MAILER-DAEMON at bioruby.org (Mail Delivery System)
>>> Date: 2005年12月11日 13:29:39:JST
>>> To: ktym at hgc.jp
>>> Subject: Undelivered Mail Returned to Sender
>>>
>>> This is the Postfix program at host kumamushi.bioruby.org.
>>>
>>> I'm sorry to have to inform you that your message could not be
>>> be delivered to one or more recipients. It's attached below.
>>>
>>> For further assistance, please send mail to <postmaster>
>>>
>>> If you do so, please include this problem report. You can
>>> delete your own text from the attached returned message.
>>>
>>> The Postfix program
>>>
>>> <trevor at corevx.com>: host mail.corevx.com[207.7.108.149] said: 554 Service
>>> unavailable; Client host [218.223.192.34] blocked using
>>> sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>> (in reply to RCPT TO command)
>>> Reporting-MTA: dns; kumamushi.bioruby.org
>>> X-Postfix-Queue-ID: E9C0D2120BC
>>> X-Postfix-Sender: rfc822; ktym at hgc.jp
>>> Arrival-Date: Sun, 11 Dec 2005 13:29:37 +0900 (JST)
>>>
>>> Final-Recipient: rfc822; trevor at corevx.com
>>> Action: failed
>>> Status: 5.0.0
>>> Diagnostic-Code: X-Postfix; host mail.corevx.com[207.7.108.149] said: 554
>>> Service unavailable; Client host [218.223.192.34] blocked using
>>> sbl-xbl.spamhaus.org; http://www.spamhaus.org/query/bl?ip=218.223.192.34
>>> (in reply to RCPT TO command)
>>>
>>> From: Toshiaki Katayama <ktym at hgc.jp>
>>> Date: 2005年12月11日 13:29:35:JST
>>> To: Trevor Wennblom <trevor at corevx.com>
>>> Subject: Re: Bio::NucleicAcid na.rb to_re bug
>>>
>>>
>>> Hello Trevor,
>>>
>>> Ughh, how could you receive my e-mail .....?
>>>
>>> I put the massage and the code at
>>>
>>> http://kumamushi.org/~k/tmp/na_to_re.txt
>>>
>>> Toshiaki
>>>
>>> On 2005/12/11, at 13:14, Toshiaki Katayama wrote:
>>>
>>>> Trevor,
>>>>
>>>> I think to embed all possible combinations would be most efficient, after all.
>>>> Please see attached file, as your mail server reject my e-mail probably because
>>>> most of the contents was code...
>>>>
>>>> On 2005/12/11, at 3:34, Trevor Wennblom wrote:
>>>>
>>>>> Toshiaki Katayama wrote:
>>>>>
>>>>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>>>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>>>>
>>>>>
>>>>>
>>>>> Great, I'll try to play with the ad-hoc code later this week.
>>>>>
>>>>> Thanks again for all your help Toshiaki!
>>>>> Trevor
>>>>
>>>> <na_to_re.txt.gz>
>>>
>>>
>>>
>>
>>
>>
>> On 2005/12/11, at 12:55, Toshiaki Katayama wrote:
>>> To embed all possible combinations would be most efficient, after all...
>>>
>>> def to_re(seq, rna = false)
>>> replace = {
>>> 'y' => '[tcy]',
>>> 'r' => '[agr]',
>>> 'w' => '[atw]',
>>> 's' => '[gcw]',
>>> 'k' => '[tgk]',
>>> 'm' => '[acm]',
>>> 'b' => '[tgcyskb]',
>>> 'd' => '[atgrwkd]',
>>> 'h' => '[atcwmyh]',
>>> 'v' => '[agcmrsv]',
>>> 'n' => '[atgcyrwskmbdhvn]'
>>> }
>>>
>>> str = seq.to_s
>>> str.gsub!(/[^atgcu]/) { |base|
>>> replace[base] || base
>>> }
>>> if rna
>>> str.tr!("t", "u")
>>> end
>>> Regexp.new(str)
>>> end
>>>
>>> On 2005/12/11, at 3:34, Trevor Wennblom wrote:
>>>
>>>> Toshiaki Katayama wrote:
>>>>
>>>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>>>
>>>>
>>>>
>>>> Great, I'll try to play with the ad-hoc code later this week.
>>>>
>>>> Thanks again for all your help Toshiaki!
>>>> Trevor
>>>
>>
>>
>>
>> On 2005/12/11, at 3:26, Toshiaki Katayama wrote:
>>> Trevor,
>>>
>>> I have commited the following code for the meanwhile.
>>>
>>> def to_re(seq, rna = false)
>>> str = seq.to_s
>>> str.gsub!(/[^atgcu]/) { |base|
>>> NAMES[base] || '.'
>>> }
>>> if rna
>>> str.tr!("t", "u")
>>> end
>>> Regexp.new(str)
>>> end
>>>
>>> And the below is a ad-hoc implementation to include all possible combination of the ambiguous bases.
>>> I removed brackets and sorted the bases contained in the ambiguous bases for easier implementation...
>>>
>>> NAMES = {
>>> 'y' => 'ct',
>>> 'r' => 'ag',
>>> 'w' => 'at',
>>> 's' => 'cg',
>>> 'k' => 'gt',
>>> 'm' => 'ac',
>>>
>>> 'b' => 'cgt',
>>> 'd' => 'agt',
>>> 'h' => 'act',
>>> 'v' => 'acg',
>>>
>>> 'n' => 'acgt',
>>>
>>> 'a' => 'a',
>>> 't' => 't',
>>> 'g' => 'g',
>>> 'c' => 'c',
>>> 'u' => 'u',
>>>
>>> 'A' => 'Adenine',
>>> 'T' => 'Thymine',
>>> 'G' => 'Guanine',
>>> 'C' => 'Cytosine',
>>> 'U' => 'Uracil',
>>>
>>> 'Y' => 'pYrimidine',
>>> 'R' => 'puRine',
>>> 'W' => 'Weak',
>>> 'S' => 'Strong',
>>> 'K' => 'Keto',
>>> 'M' => 'aroMatic',
>>>
>>> 'B' => 'not A',
>>> 'D' => 'not C',
>>> 'H' => 'not G',
>>> 'V' => 'not T',
>>> }
>>>
>>> def to_re(seq, rna = false)
>>> str = seq.to_s
>>> str.gsub!(/[^atgcu]/) { |base|
>>> set = NAMES[base].dup || '.'
>>> case set.size
>>> when 1
>>> # just pass the base
>>> when 2
>>> # add ambiguous base
>>> set += base
>>> when 3
>>> # add any combination of the ambiguous bases
>>> set = combination_3c2(set)
>>> set += base
>>> when 4
>>> # use the knowledge (or implement recursive method)
>>> set = "atgcyrwskmbdhv"
>>> end
>>> set
>>> }
>>> if rna
>>> str.tr!("t", "u")
>>> end
>>> Regexp.new(str)
>>> end
>>>
>>> private
>>>
>>> def combination_3c2(str)
>>> result = ''
>>> rev = NAMES.invert
>>> ary = str.split(//)
>>> [ ary[0] + ary[1], ary[0] + ary[2], ary[1] + ary[2] ].each do |key|
>>> result << rev[key]
>>> end
>>> result + str
>>> end
>>>
>>> Toshiaki
>>>
>>> On 2005/12/11, at 2:37, Trevor Wennblom wrote:
>>>
>>>> Toshiaki Katayama wrote:
>>>>
>>>>> Trevor,
>>>>>
>>>>> In this sense, we need to include permutation of ambiguous bases recursively.
>>>>> e.g. b => [tgc] not just match /[btgc]/ but also /[bkystgc]/ right?
>>>>>
>>>>
>>>>
>>>> Yes, you're correct. This is a bigger algorithm issue than I first realized.
>>>
>>
>>
>>
>>
>> On 2005/12/11, at 2:33, Toshiaki Katayama wrote:
>>> Trevor,
>>>
>>> In this sense, we need to include permutation of ambiguous bases recursively.
>>> e.g. b => [tgc] not just match /[btgc]/ but also /[bkystgc]/ right?
>>>
>>> Toshiaki
>>>
>>> On 2005/12/10, at 3:58, Trevor Wennblom wrote:
>>>
>>>> Trevor Wennblom wrote:
>>>>> For the purposes of the Regexp (untested), would you be interested in adding the following to NAMES:
>>>>>
>>>>> 'n' => '.',
>>>>> 'x' => '.?',
>>>>>
>>>>> 'N' => 'aNy',
>>>>> 'X' => 'any or deletion',
>>>>
>>>> Just to follow-up -
>>>> I see that NAMES presently has:
>>>> 'n' => '[atgc]',
>>>>
>>>> Which of course wont match 's', 'y', 'w', etc. I'm trying to think of a case where 'n'=>'.' or 'x'=>'.?' may be problematic but I haven't been able to think of a problem with it yet. Perhaps it would fail in situations where the user may have provided non-nucleotide characters .
>>>>
>>>> I suppose if we wanted to be more explicit we could do:
>>>> 'n' => '[atgcyrwskmbdhv]',
>>>> 'x' => '[atgcyrwskmbdhv]?',
>>>>
>>>> 'N' => 'aNy',
>>>> 'X' => 'any or deletion',
>>>>
>>>> I can come up with some tests for Regexp objects where 'x' may be involved to demonstrate it's utility.
>>>>
>>>> Thanks,
>>>> Trevor
>>>
>>
>>
>>
>>
>> On 2005/12/10, at 3:46, Trevor Wennblom wrote:
>>> Toshiaki Katayama wrote:
>>>> def to_re(seq, rna = false)
>>>> str = seq.to_s
>>>> str.gsub!(/./) { |base|
>>>> re = NAMES[base].dup || '.'
>>>> if re[0] == ?[
>>>> re[1,0] = base
>>>> end
>>>> re
>>>> }
>>>> if rna
>>>> str.tr!("t", "u")
>>>> end
>>>> Regexp.new(str)
>>>> end
>>>>
>>> ...
>>>> str.gsub!(/[^atgcu]/) { |base|
>>>>
>>>> would be much better!
>>>
>>> I haven't had a chance to test it, but just from looking at it seems very impressive Toshiaki! I think it's good to go.
>>>
>>> For the purposes of the Regexp (untested), would you be interested in adding the following to NAMES:
>>>
>>> 'n' => '.',
>>> 'x' => '.?',
>>>
>>> 'N' => 'aNy',
>>> 'X' => 'any or deletion',
>>>
>>>
>>> Again, excellent work!
>>> Trevor
>>
>>
>>
>
>
>
>
More information about the BioRuby
mailing list