[BioPython] calculate F-Statistics from SNP data

Fri Oct 17 05:39:41 EDT 2008

On Thu, Oct 16, 2008 at 12:23 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Hi,
> > I was going to write a python program to calculate Fst statistics from a
> > sample of SNP data.  Is there any module already available to do that
> > in biopython, that I am missing?  I saw there is a 'PopGen' module, but
> > the Cookbook says it doesn't support sequence data.
> > Is someone actually writing any module in python to calculate such
> > statistics?
>
> I think this will be a question for Tiago (the Bio.PopGen author),
> although others on the list may have also tackled similar questions.
>
> In terms of reading in the SNP data, what file format will you be
> loading?  Does Bio.SeqIO currently suffice?
>

Hi,
thank you very much all of you for the replies.
Actually I am going to use tped[1] and tfam[1] files as input, formatted
with the plink program[2].
Bio.SeqIO doesn't support these format, but this is right because they don't
cointain only sequences but rather elements like Tiago was saying.

Let's say I try to write a parser for these two file formats. In which
biopython object should I save them? Is there any kind of 'Individual' or
'Population' object in biopython?
I see from the cookbook that Bio.GenPop.Record is representanting
populations and individual as list[3], and that there is not a 'Population'
or 'Individual' object.
I think that it is a good approach, because these kind of files tend to be
very big and instantiating an Individual object instead of a tuple for every
line of the file would be take much memory.
But are you going to implement some kind of 'Individual' or 'Population'
object?
Moreover, python 2.6 will implement a new kind of data object, called 'named
tuple' [4], to implement these kind of records. It could be a good
compromise (maybe I'll better start a new thread about this and explain
better).

[1] tped, tfam: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr
[2] plink: http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml
[3] biopython cookbook, popgen:
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112
[4] named tuples in python 2.6: http://code.activestate.com/recipes/500261/

>
> Have you looked into what (if any) additional python libraries you
> would need?  For any Biopython addition, a dependency on just numpy
> that would be preferable, but Tiago has previously suggested an
> optional dependency on scipy for additional statistics needed in
> population genetics.
>
> Peter
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it