[BioPython] [PopGen] a random Haplotype Sets generator

Fri Nov 14 06:21:02 EST 2008

On Thu, Nov 13, 2008 at 7:57 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>
> Oh, I am on the right list? It does say Biopython... :-)

I added a [PopGen] tag to the subject of the mail, to indicate that it
was related to the PopGen module and it s development.

>> This is a module to generate test sets to help the development of the
>> other future PopGen modules.
>>
>
> Great!
>
>> For example, we wanted to write a function to calculate the Fst
>> statistics over snps data.
>> The Fst is an index that tells you if, given two populations, they
>> follow the same pattern of variability, and therefore can be
>> considered as two subpopulations of the same population or not.
>> To test such a script, you will need a module like the one I wrote
>> here: for example, you could create two samples of 200 individuals
>> with the same frequencies at every site, and see what your Fst script
>> tells. Then, probably, compare the results with another tool that is
>> already know to calculate the Fst correctly.
>>
>> So I was just asking for any suggestions - which models should I
>> implement in this generator? And how? Which parameters should it
>> accept? Should it use the random module?
>>
>>
>
> The importance is more the API than the actual implementation - as the later
> posts by Tiago indicate.
>
> Some coding related comments:
> freqs_per_site and alleles_per_site are lists.
> This is a problem because these could get very large, it is inflexible and
> you could become out of sync.

they are not required to be lists.
freqs and alleles _per site can be any kind python object with a
__getitem__ and a __len__method.

What I would like to do now is to create two 'Freqs' and 'Alleles'
objects with such methods, so I can use them as containers for these
informations without having to change the actual interface.

The __getitem__ function could return a background value (0.5) for any
position except for those that are defined to be differently when
initialized. This would save memory space also.

Have a look at the new changes:
- http://tinyurl.com/64tfef

(http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py)

> While you do check for length, you should be more informative of which has a
> different length.
> Also you need to check for valid inputs (frequencies between 0 and 1, bases
> in ACGT).

ok

> Some other comments
>
> Perhaps I misunderstood the situation but the major problem that I have is
> that the locations are treated as independent so your model assumes unlinked
> loci. I just don't find this a useful scenario.

This depends on which parameters you pass to the HaplotypesGenerator
init function.
I would prefer to create a basic module that generates sequences given
the frequencies and alleles in every position, and other functions to
create its parameters.

I forgot to say it in the first mail, but if you want to use more
sophisticated scenarios - like populations that have suffered a
bottleneck or have a particular history - there are already better
tools available to do that; we should think on how to integrate this
module with them.
Maybe I should rename this module as 'SimpleHaplotypesSampler'.

> You assume that the user knows exactly which locations and frequency to
> change. Often you just want a random frequency and random location. In that
> case you need to randomly select locations and frequencies based on some
> function. But I do not find the mode=='random' of paramsGenerator sufficient
> to address this. Further, you might want a random sequence of some length
> but you not want all locations to change.

ok, but consider that these are haplotypes and not sequences, so you
most likely need to have regions that are more conserved and others
that change more.
This is a good question, about which models to implement, but I would
need to find a better way to represent frequencies first, and then
think about which models to implement.

> While you could set those
> locations to zero, a more sparse form would be desirable.

I think the idea of a Freqs_per_site object should fix this

> Also, the randomly
> generated frequencies should have a way to be limited in other ranges than
> the [0 to 1) of random.random. Obviously the question is whether or not the
> user has to do it themselves.

> One particular use of generating SNPs pertains to known genes or sequences.
>  In such cases to would be great to use a known sequence as a base for the
> simulation.

> Further, it would be very useful be able incorporate known SNP
> data especially frequencies from some source like Hapmap
> (http://www.hapmap.org/).

This is too complicated for the moment. We would need to develop a
standard way to handle HapMap and in general SNPs first.

> A nice but harder problem is to do this based on a
> protein sequence since many diseases refer to amino acids.

This is a good idea, but at the moment I was thinking more on
genotypes than other characters.
I would need to have a better way to handle all these suggestions..
too bad github doesn't provide an integrated ticketing system.

> Perhaps my biggest 'disappointment' is the lack of ancestry control because
> I also interested in families or some admixture in a population. This just
> generates sequences randomly assuming you are randomly selecting individuals
> from a homogenous population.

I think simcoal can do this?

> I do understand this usage so it is not that
> important to include this here.
>
>
>
> Bruce
>
>
>
>
>
>
>
>
>
>
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it