[BioPython] calculate F-Statistics from SNP data

Thu Oct 16 11:26:52 EDT 2008

The task view on Genetics for R provides a good starting point to find
R packages related to the field:

http://www.freestatistics.org/cran/web/views/Genetics.html

On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Antão <tiagoantao at gmail.com> wrote:
>> Hi,
>>
>> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I was going to write a python program to calculate Fst statistics from a
>>> sample of SNP data.
>>> Is there any module already available to do that in biopython, that I am
>>> missing?
>>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>>> sequence data.
>>> Is someone actually writing any module in python to calculate such
>>> statistics?
>>
>> The answer to this has to be done in parts, because it is actually a
>> bunch of related (but different) issues
>>
>>
>> On the data
>> 1. Sequence support. Bio.PopGen doesn't support statistics for
>> sequences (like Tajima D and the like), BUT that is not relevant if
>> you want to do frequency based statistics (like good old Fst), you
>> just have to count frequencies and put into a "frequency format"
>> 2. SNPs is actually not a sequence, but a single element, so it
>> becomes easier. What you need at the end is something like this:
>> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
>> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
>> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
>> And so on... You have to end up with frequency counts per population
>> So, as long as you convert data (sequence, SNP, microsatellite) to
>> frequency counts per population, there are no issues with the type of
>> data.
>>
>> On calculating the statistics (Fst)
>> 1. I am fully aware that core statistics like Fst (I work with Fst a
>> lot myself) are fundamental in a population genetics module, but I
>> sincerely don't know how to proceed because a long term solution
>> requires generic statistical support (e.g., chi-square tests
>> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
>> (and I will not maintain generic statistics code myself). I know that
>> Bio.PopGen is of little use without support for standard statistics.
>> 2. A workaround (for which I have code written - but not commited to
>> the repository - I can give it to you) is to invoke GenePop and get
>> the Fst estimation. This requires the data to be in GenePop format
>> (again you can convert SNPs and even sequences to frequency based
>> format)
>> 3. That being said, I have code to estimate Fst (Cockerham and Wier
>> theta and a variation from Mark Beaumont) in Python. I can give it to
>> you (but is not much tested).
>>
>>
>> On sequence data formats:
>> 1. Note that sequence data files (that I know off) have no provision
>> for population structure (you cannot say, in a standard way, sequence
>> X belongs to population Y). You have to do it in adhoc way. That means
>> you have to invent your own convention for your private use.
>> 2. Anyway, in your case I suppose you still have to extract the SNPs
>> from the sequence.
>> 3. If you want do frequency based analysis on your SNPs, I suggest you
>> do a conversion to GenePop anyway (therefore you can import your data
>> in most population structure software as GenePop format is the defacto
>> standard)...
>> 4. Because of the above there is actually no good solution for
>> automated conversion from sequence information to frequency based one
>> (in biopython or in any platform whatsoever)
>> I can give more suggestions if you give more details or have more
>> specific questions.
>
> Just a little note that the R programming language has some packages
> for population genetics and, of course, has excellent statistical
> tools.  One can interface with it via rpy.  I'm not advocating going
> this route, but just wanted to let people know about another option.
>
> Sean
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
—Matthew Simmons,
http://www.tiago.org