[BioPython] calculate F-Statistics from SNP data

Tiago Antão tiagoantao at gmail.com
Wed Oct 22 15:52:19 UTC 2008


Hi,

[Back in office now]

> Ok, I have uploaded the code to:
> - http://github.com/dalloliogm/biopython---popgen
>
> I put the code I wrote before writing in this mailing list in the folder
> PopGen/Gio

Thanks I will have a look and get acquainted with GIT.


>> I am afraid that this is not enough. Even for Fst. I suppose you are
>> acquainted with a formula with just heterozigosities.
>
> Yes, I was trying to implement a very basic formula at first.

For publication and data analysis the standard is Cockerham and Wier's
theta. The Standard Ht/(Hs-Ht) (or a variation of this) might be
misleading in regards to the amount of information that is needed.


> Yes, I agree. It was just a first try. We should collect some good
> use-cases.


In my head I divide statistics in the following dimensions:
1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as
requiring more than 1 locus, therefore is "genomic")
2. frequency based versus marker based (some statistics require
frequencies only - ie, you can calculate them irrespective of the type
of marker - This is the case of Fst. Others are marker dependent, say
Tajima D requires sequences and can only be used with sequences)
3. population structure versus no pop structure. Some stats require
population structure (again, Fst), others don't (e.g., allelic
richness)

>From my point of view, a long-term solution needs to take into account
these dimensions (and others that I might be forgetting).

One can think in a solution based on Populations and Individuals as
fundamental objects (as opposed to statistics), but, from my
experience it is very difficult to define what is an "individual"
(i.e., what kind of information you need to store - I can expand on
this). It is easier to think in terms of statistics.

One fundamental point is that we don't have many opportunities to make
it right: if we define an architecture which proves in the future to
be not sufficient, then  we will have to both maintain the old legacy
(because there will be users around whose code cannot be constantly
broken when a new version is made available) while hack the new
features in.



More information about the Biopython mailing list