[Biopython] weighted sampling of a dictionary

Thu Oct 27 20:52:00 UTC 2011

Hi George,

I was actually doing this yesterday :)

The function I came up with takes two lists:

import random

def weighted_sample(population, weights):
   """ Sample from a population, given provided weights """
   if len(population) != len(weights):
     raise ValueError('Lengths of population and weights do not match')
   normal_weights = [ float(w)/sum(weights) for w in weights ]
   val = random.random()
   running_total = 0
   for index, weight in enumerate(normal_weights):
     running_total += weight
     if val < running_total:
       return population[index]

Which seems to do the trick:

population = ['AAU' ,'AAC', 'AAG']
weights = [2,5,3]
sample = [weighted_sample(population, weights) for _ in range(1000)]
sample.count('AAC') #should be about 500

If that's too slow, check out numpy's random.multinomial() function.

I haven't tested this, but this should get you the number of times you  
get each codon from 1000 "draws":

import numpy as np

codons, weights = codon_dict.items()
denom = sum(weights)
normalised_weights = [float(w)/denom for w in weights]
np.random.multinomial(codons, weights, 1000)

Cheers,
David

Quoting George Devaniranjan <devaniranjan at gmail.com>:

> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>