[BioRuby] Biased Bio::Sequence randomize()
Toshiaki Katayama
ktym at hgc.jp
Mon Oct 13 20:55:41 UTC 2008
Dear Jacobsen,
> I believe that the current sequence randomization/shuffle method is severely
> biased, infrequent bases are more likely to occur in the end of the sequence
> than in the beginning:
You are right.
I had fixed a while ago, but it seems that I forgot to commit to the repository, sorry.
Could you try the following replacement?
def randomize(hash = nil)
if hash
tmp = ''
hash.each {|k, v|
tmp += k * v.to_i
}
else
tmp = self
end
len = tmp.length - 1
seq = ''
ary = (0..len).to_a.sort_by{rand}
ary.each do |idx|
if block_given?
yield tmp[idx]
else
seq << tmp[idx]
end
end
return self.class.new(seq)
end
I haven't yet prepared my git environment, so I hope someone (Goto-san?) can
take care of the following patch.
Regards,
Toshiaki Katayama
% cvs diff -u lib/bio/sequence/common.rb
===========================================
dev.open-bio.org - Authorized Access Only
===========================================
Index: lib/bio/sequence/common.rb
===================================================================
RCS file: /home/repository/bioruby/bioruby/lib/bio/sequence/common.rb,v
retrieving revision 1.6
diff -u -r1.6 common.rb
--- lib/bio/sequence/common.rb 27 Dec 2007 17:36:02 -0000 1.6
+++ lib/bio/sequence/common.rb 13 Oct 2008 20:50:27 -0000
@@ -241,28 +241,22 @@
# * (optional) _hash_: Hash object
# *Returns*:: new Bio::Sequence::NA/AA object
def randomize(hash = nil)
- length = self.length
if hash
- length = 0
- count = hash.clone
- count.each_value {|x| length += x}
+ tmp = ''
+ hash.each {|k, v|
+ tmp += k * v.to_i
+ }
else
- count = self.composition
+ tmp = self
end
-
+ len = tmp.length - 1
seq = ''
- tmp = {}
- length.times do
- count.each do |k, v|
- tmp[k] = v * rand
- end
- max = tmp.max {|a, b| a[1] <=> b[1]}
- count[max.first] -= 1
-
+ ary = (0..len).to_a.sort_by{rand}
+ ary.each do |idx|
if block_given?
- yield max.first
+ yield tmp[idx]
else
- seq += max.first
+ seq << tmp[idx]
end
end
return self.class.new(seq)
On 2008/10/14, at 4:25, Anders Jacobsen wrote:
> Hi,
>
>
>
> I believe that the current sequence randomization/shuffle method is severely
> biased, infrequent bases are more likely to occur in the end of the sequence
> than in the beginning:
>
>
>
> class Array
>
> #returns a histogram represented as a hash
>
> def hist()
>
> h = Hash.new(0)
>
> self.each{|x| h[x] += 1}
>
> h
>
> end
>
> end
>
>
>
>>> (1..1000).to_a.map{|i|
> Bio::Sequence::NA.new("ccccggac").randomize.index("a") + 1}.hist.sort
>
> => [[1, 36], [2, 51], [3, 62], [4, 97], [5, 127], [6, 189], [7, 219], [8,
> 219]]
>
>
>
> I suggest implementing this method using the unbiased Fisher-Yates shuffle
> (http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)
>
>
>
> class Array
>
> def shuffle()
>
> arr = self.dup
>
> arr.size.downto 2 do |j|
>
> r = Kernel::rand(j)
>
> arr[j-1], arr[r] = arr[r], arr[j-1]
>
> end
>
> arr
>
> end
>
> end
>
>
>
> (1..1000).to_a.map{|i|
> Bio::Sequence::NA.new("ccccggac").split("").shuffle.index("a") +
> 1}.hist.sort
>
> => [[1, 121], [2, 127], [3, 135], [4, 119], [5, 145], [6, 104], [7, 126],
> [8, 123]]
>
>
>
> -Anders Jacobsen
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list