[BioPython] calculate F-Statistics from SNP data

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Mon Oct 20 13:57:27 UTC 2008


On Mon, Oct 20, 2008 at 7:41 AM, Tiago Antão <tiagoantao at gmail.com> wrote:

> Hi,
>
> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > ok, thank you very much!!
> > I would like to use git to keep track of the changes I will make to the
> > code.
> > What do you think if I'll upload it to http://github.com and then upload
> it
> > back on biopython when it is finished?
> > I am not sure, but I think it would be possible to convert the logs back
> to
> > cvs to reintegrate the changes in biopython.
>
> I think it is a good idea. When we reintegrate back I think there will
> be no need to backport the commit logs anyway.


Ok, I have uploaded the code to:
- http://github.com/dalloliogm/biopython---popgen

I put the code I wrote before writing in this mailing list in the folder
PopGen/Gio
-
http://github.com/dalloliogm/biopython---popgen/tree/6f6fa66cda1908dc8334ab6e9e69b7c85290a8be/src/PopGen/Gio
However, I plan to integrate these scripts with your code or re-write the
completely (well, your code is a lot better than mine :) ).

Just a curiosity: why do you use the '<>' operator instead of '!='?
Is it better supported in python 3.0?


> > One of the problems we are having here, is that it takes too much RAM
> memory
> > to store all the information about characters for every population.
> > I was going to write a Population object, in which I'll store only the
> total
> > count of heterozygotes, individuals, and what is needed, instead of the
> > information about characters (('a', 'a'), ('a', 'c'), ...)
>
> I am afraid that this is not enough. Even for Fst. I suppose you are
> acquainted with a formula with just heterozigosities.


Yes, I was trying to implement a very basic formula at first.


> That is more of
> just a textbook formula only.  The Fst standard estimator is really
> Cockerham and Wier Theta estimator (1984 paper)


Bioperl's Bio::PopGen::PopStats uses the same formula:
-
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PopGen/PopStats.html#POD3

"""
Bioperl's Bio:Based on diploid method in Weir BS, Genetics Data
Analysis II, 1996
           page 178.
"""

, and I think it needs
> individual information (or at the very least allele counts). Check my
> implementation of Fst, which should be it (less the bugs that are in).
> Maybe my implementation of theta is wrong, which is a possiblity. But
> theta is the standard.
>
> May I do a suggestion for your problem? Split in SNP groups (like 100
> at a time) and calculate 100 Fsts at time. Store the calculated Fsts
> to disk and then join them at the end.
>

Thanks - that's a good suggestion


>
>
> I am currently traveling (which seems to be my constant state). When I
> arrive back at office, on Wednsday, I will make a few suggestions on
> how we can structure things. I have a few ideas that I would like to
> share and discuss.
>

Have a nice trip!


>
> > class Marker:
> >   total_heterozygotes_count = 0
> >   total_population_count = 0
> >   total_Purines_count = 0 # this could be renamed, of course
> >   total_Pyrimidines_count = 0
>
>
> Also, your representation seems to be targetted toward SNPs, people
> use lots of other things (microsatellites are still used a lot). We
> have to think about something that is useful to the general public.
> Let me get back to you on Wednesday we ideas. If you are interested we
> can work together to make a nice population genetics module that can
> be used in a wide range of situations.
>

Yes, I agree. It was just a first try. We should collect some good
use-cases.



-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it




More information about the Biopython mailing list