[BioRuby] Biased Bio::Sequence randomize()
Anders Jacobsen
andersbj at binf.ku.dk
Mon Oct 13 19:25:16 UTC 2008
Hi,
I believe that the current sequence randomization/shuffle method is severely
biased, infrequent bases are more likely to occur in the end of the sequence
than in the beginning:
class Array
#returns a histogram represented as a hash
def hist()
h = Hash.new(0)
self.each{|x| h[x] += 1}
h
end
end
>> (1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").randomize.index("a") + 1}.hist.sort
=> [[1, 36], [2, 51], [3, 62], [4, 97], [5, 127], [6, 189], [7, 219], [8,
219]]
I suggest implementing this method using the unbiased Fisher-Yates shuffle
(http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)
class Array
def shuffle()
arr = self.dup
arr.size.downto 2 do |j|
r = Kernel::rand(j)
arr[j-1], arr[r] = arr[r], arr[j-1]
end
arr
end
end
(1..1000).to_a.map{|i|
Bio::Sequence::NA.new("ccccggac").split("").shuffle.index("a") +
1}.hist.sort
=> [[1, 121], [2, 127], [3, 135], [4, 119], [5, 145], [6, 104], [7, 126],
[8, 123]]
-Anders Jacobsen
More information about the BioRuby
mailing list