[BioPython] calculate F-Statistics from SNP data

Mon Oct 20 01:41:56 EDT 2008

Hi,

On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> ok, thank you very much!!
> I would like to use git to keep track of the changes I will make to the
> code.
> What do you think if I'll upload it to http://github.com and then upload it
> back on biopython when it is finished?
> I am not sure, but I think it would be possible to convert the logs back to
> cvs to reintegrate the changes in biopython.

I think it is a good idea. When we reintegrate back I think there will
be no need to backport the commit logs anyway.

> One of the problems we are having here, is that it takes too much RAM memory
> to store all the information about characters for every population.
> I was going to write a Population object, in which I'll store only the total
> count of heterozygotes, individuals, and what is needed, instead of the
> information about characters (('a', 'a'), ('a', 'c'), ...)

I am afraid that this is not enough. Even for Fst. I suppose you are
acquainted with a formula with just heterozigosities. That is more of
just a textbook formula only.  The Fst standard estimator is really
Cockerham and Wier Theta estimator (1984 paper), and I think it needs
individual information (or at the very least allele counts). Check my
implementation of Fst, which should be it (less the bugs that are in).
Maybe my implementation of theta is wrong, which is a possiblity. But
theta is the standard.

May I do a suggestion for your problem? Split in SNP groups (like 100
at a time) and calculate 100 Fsts at time. Store the calculated Fsts
to disk and then join them at the end.

As a general rule, whatever goes into biopython has to be general
enough to accomodate all standard statistics (not just Fs). One cannot
make a solution that is taliored to solve just our personal research
issues.

I am currently traveling (which seems to be my constant state). When I
arrive back at office, on Wednsday, I will make a few suggestions on
how we can structure things. I have a few ideas that I would like to
share and discuss.

> class Marker:
>   total_heterozygotes_count = 0
>   total_population_count = 0
>   total_Purines_count = 0 # this could be renamed, of course
>   total_Pyrimidines_count = 0

Also, your representation seems to be targetted toward SNPs, people
use lots of other things (microsatellites are still used a lot). We
have to think about something that is useful to the general public.
Let me get back to you on Wednesday we ideas. If you are interested we
can work together to make a nice population genetics module that can
be used in a wide range of situations.