[BioPython] calculate F-Statistics from SNP data

Thu Oct 16 11:16:51 EDT 2008

On Thu, Oct 16, 2008 at 10:10 AM, Tiago Antão <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.

Just a little note that the R programming language has some packages
for population genetics and, of course, has excellent statistical
tools.  One can interface with it via rpy.  I'm not advocating going
this route, but just wanted to let people know about another option.

Sean