[BioPython] calculate F-Statistics from SNP data

Thu Oct 16 14:10:47 UTC 2008

Hi,

On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I was going to write a python program to calculate Fst statistics from a
> sample of SNP data.
> Is there any module already available to do that in biopython, that I am
> missing?
> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
> sequence data.
> Is someone actually writing any module in python to calculate such
> statistics?

The answer to this has to be done in parts, because it is actually a
bunch of related (but different) issues

On the data
1. Sequence support. Bio.PopGen doesn't support statistics for
sequences (like Tajima D and the like), BUT that is not relevant if
you want to do frequency based statistics (like good old Fst), you
just have to count frequencies and put into a "frequency format"
2. SNPs is actually not a sequence, but a single element, so it
becomes easier. What you need at the end is something like this:
For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
And so on... You have to end up with frequency counts per population
So, as long as you convert data (sequence, SNP, microsatellite) to
frequency counts per population, there are no issues with the type of
data.

On calculating the statistics (Fst)
1. I am fully aware that core statistics like Fst (I work with Fst a
lot myself) are fundamental in a population genetics module, but I
sincerely don't know how to proceed because a long term solution
requires generic statistical support (e.g., chi-square tests
Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
(and I will not maintain generic statistics code myself). I know that
Bio.PopGen is of little use without support for standard statistics.
2. A workaround (for which I have code written - but not commited to
the repository - I can give it to you) is to invoke GenePop and get
the Fst estimation. This requires the data to be in GenePop format
(again you can convert SNPs and even sequences to frequency based
format)
3. That being said, I have code to estimate Fst (Cockerham and Wier
theta and a variation from Mark Beaumont) in Python. I can give it to
you (but is not much tested).

On sequence data formats:
1. Note that sequence data files (that I know off) have no provision
for population structure (you cannot say, in a standard way, sequence
X belongs to population Y). You have to do it in adhoc way. That means
you have to invent your own convention for your private use.
2. Anyway, in your case I suppose you still have to extract the SNPs
from the sequence.
3. If you want do frequency based analysis on your SNPs, I suggest you
do a conversion to GenePop anyway (therefore you can import your data
in most population structure software as GenePop format is the defacto
standard)...
4. Because of the above there is actually no good solution for
automated conversion from sequence information to frequency based one
(in biopython or in any platform whatsoever)
I can give more suggestions if you give more details or have more
specific questions.