[BioPython] calculate F-Statistics from SNP data

Sun Oct 19 10:50:54 EDT 2008

On Sat, Oct 18, 2008 at 6:50 PM, Tiago Antão <tiagoantao at gmail.com> wrote:

>  > here have used bioperl Bio::PopGen::PopStat, but we saw that using that
> > module as it is now in bioperl is too much computationally-expensive for
> our
> > resources.
> > So, we are going to either refactor the bioperl function, or to write
> custom
> > scripts in python to calculate Fst.
> > I can program perl, but I would prefer to use python in use, since I like
> > object oriented programming.
>
> You can find my (completely unofficial, completely untested) PopGen module
> here:
> http://popgen.eu/PopGen.tar.gz
> You should take a biopython distro and replace the PopGen directory
> with the contents of this one.
>

ok, thank you very much!!
I would like to use git to keep track of the changes I will make to the
code.
What do you think if I'll upload it to http://github.com and then upload it
back on biopython when it is finished?
I am not sure, but I think it would be possible to convert the logs back to
cvs to reintegrate the changes in biopython.

> There are 2 ways to calculate Fst:
> Doing something this:
> from Bio.PopGen.Stats.Structural import Fst
>
> fst = Fst()

> fst.add_pop('Pop 1', [('a', 'a'), ('a', 'c'), ('a','c')])
> fst.add_pop('Pop 2', [('a', 'c'), ('a', 'c'), ('a','c')])
>

One of the problems we are having here, is that it takes too much RAM memory
to store all the information about characters for every population.
I was going to write a Population object, in which I'll store only the total
count of heterozygotes, individuals, and what is needed, instead of the
information about characters (('a', 'a'), ('a', 'c'), ...)

It is something like this:
class Population:
  markers = []

class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0

>
> Or using the new GenePop code (see GenePop/Controller.py), by using
> genepop to calculate Fsts.
>
> A few comments:
> 1. I don't trust my own Fst code (not tested at all, I am actually
> using GenePop as above). You can find it on PopGen.Stats.Structural
> (Fst, and also FstBeaumont). There is code there for Fst, Fis and Fit.
> Also Fk (I trust the Fk code, but its the only one)

I will ask my group leader to help me in writing down some good test data.
I'll let you know when I will speak with him.

>
> 2. If your problem is performance, I think you have to go to a faster
> language. Scripting languages strongly underperfom on the speed issue.
> I find this problem lots of times. C, C++ and Java (yes, java for
> performance) is what I use. Perl, Python and other scripting languages
> are quite bad performance-wise.

I know.. but I think this time, the problem is in memory usage.

> 3. You can find a Fst implementation in C++ on simuPop (see file
> stator.cpp). GenePop code must also have Fst implemented.

4. I have a Fst based application using Biopython PopGen with Fst (but
> for another application) - Fdist, you can find it at:
> http://www.biomedcentral.com/1471-2105/9/323 . Module Bio.PopGen.FDist
> (incidentally, you can also use this to calculate Fst ;) ).
> 5. My code on Bio.PopGen.Stats is surely not in its final form. I have
> a plan to change it massively. If you are interested in participating
> in the discussion, you are welcome.
>
> > This is to say that if you want, we can work on the same code, and
> > contribute it to biopython.
>
> This would be most welcome. I have almost no sense of propriety for
> the code that is on Bio.PopGen. So, if you work on this, go ahead!
>
>
> > I am writing a ped file parser (everybody here is used to this format,
> and I
> > don't know GenePop :( ), and a simple script that calculates Fst with the
> > most basic formula.
> > I am also trying to design some good tests, and I am using subversion as
> as
> > source control system.
> > Maybe I can also send this to you, so you can have a look (but it is
> still
> > very basic, I started yesterday).
>
> Again, any contribution would be most welcome. Regarding parsers I
> would suggest you to have a look at how parsers are done in Biopython.
> I am following the "standard". You can find an example on
> Bio.PopGen.GenePop.__init__.py. From my point of view I have nothing
> against a "non standard" parser as long as it is documented and
> commented.

Thank you very much.. I know more or less how parsers are written in
biopython, but I have never written one myself.

>
>
> Again, feel free to take this discussion to biopython-dev, especially
> if you are willing to contribute.
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it