[BioPython] calculate F-Statistics from SNP data

Thu Oct 16 14:14:28 UTC 2008

Just a minor point: I am so used to work in Fst that I mentally
converted your "F-statistics" to Fst. Most of my mail still stands.
The only point that changes a bit is that I only have code for Fst, so
I cannot help you with any other.

On Thu, Oct 16, 2008 at 3:10 PM, Tiago Antão <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.
>

-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
—Matthew Simmons,
http://www.tiago.org