[Biopython] correction and follow up to previous question

Tue Jun 7 15:25:55 UTC 2011

 On Tue, Jun 7, 2011 at 3:55 PM, George Devaniranjan
<devaniranjan at gmail.com>wrote:

> Sorry guys--It seems to work when I define the seqence as  a LIST
>
> however I have another  doubt......
>
> the top is the original seqence the bottom the shuffled seqence--while
some
> residues are shuffled, its not "very" shuffled
> is this "normal" ?

With random.shuffle, all sequence combinations are supposed to be equally
probable. This means a sequence that's similar to the input is possible, as
is a sequence that's very different.

>>> stuff = list('ASDFSDFG')
>>> random.shuffle(stuff)
>>> ''.join(stuff)
'GFSFADDS'
>>> random.shuffle(stuff)
>>> ''.join(stuff)
'FADDSGSF'

In theory, anything is possible.
http://dilbert.com/strips/comic/2001-10-25/

On Tue, Jun 7, 2011 at 11:01 AM, João Rodrigues <anaryin at gmail.com> wrote:

> Hey George,
>
> From the Python Docs:
>
> random.shuffle(*x*[, *random*])
> >
> > Shuffle the sequence *x* in place. The optional argument *random* is a
> > 0-argument function returning a random float in [0.0, 1.0); by default,
> this
> > is the function random()<
> http://docs.python.org/library/random.html#random.random>
> > .
> >
> > Note that for even rather small len(x), the total number of permutations
> > of *x* is larger than the period of most random number generators; this
> > implies that most permutations of a long sequence can never be generated.
> >
> This might be the answer to your last question. A more efficient
> combination
> perhaps would be to use random.choice and then append to a list.. perhaps
> this leads to better randomized sequences, but I'm talking out of thin air,
> not based on experience..
>
>
According to the docs, the pseudo-RNG implementation has a cycle of
 2**19937-1. If I'm understanding random.shuffle correctly, a string of
length k has k! permutations. So:

>>> 2**19937-1 < math.factorial(2081)
True
>>> 2**19937-1 < math.factorial(2080)
False

It should work as expected for lists of up to 2080 elements, and after that,
gradually become less purely "random" (but still behave fairly well for most
use cases in biology). So random.shufffle is an acceptable choice for
protein sequences, but not for whole genomes. But for whole genomes you'd
probably want to use a more clever HMM-based model for generating random
sequences, anyway.

Cheers,
Eric