[Biopython] weighted sampling of a dictionary
David Winter
winda002 at student.otago.ac.nz
Thu Oct 27 20:52:00 UTC 2011
Hi George,
I was actually doing this yesterday :)
The function I came up with takes two lists:
import random
def weighted_sample(population, weights):
""" Sample from a population, given provided weights """
if len(population) != len(weights):
raise ValueError('Lengths of population and weights do not match')
normal_weights = [ float(w)/sum(weights) for w in weights ]
val = random.random()
running_total = 0
for index, weight in enumerate(normal_weights):
running_total += weight
if val < running_total:
return population[index]
Which seems to do the trick:
population = ['AAU' ,'AAC', 'AAG']
weights = [2,5,3]
sample = [weighted_sample(population, weights) for _ in range(1000)]
sample.count('AAC') #should be about 500
If that's too slow, check out numpy's random.multinomial() function.
I haven't tested this, but this should get you the number of times you
get each codon from 1000 "draws":
import numpy as np
codons, weights = codon_dict.items()
denom = sum(weights)
normalised_weights = [float(w)/denom for w in weights]
np.random.multinomial(codons, weights, 1000)
Cheers,
David
Quoting George Devaniranjan <devaniranjan at gmail.com>:
> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
More information about the Biopython
mailing list