[BioPython] calculate F-Statistics from SNP data

Tiago Antão tiagoantao at gmail.com
Sat Oct 25 21:04:01 EDT 2008


[Sorry for the delay in answering]

On Thu, Oct 23, 2008 at 5:25 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:

> However I think it would be good to add to biopython at least some
> funcionality to calculate Fst statistics and parse these file formats, at
> least at the level at which BioPerl does.

Agree. Statistics is fundamental. I decided to postpone stats when I
started because I didn't want to to start with the core issue in
population genetics (being unexperienced at the start would probably
cause serious design errors). But I think now is the time.

> What if we just translate the same functionalities and copy the population
> objects from bioperl into biopython?

I don' t think the population objects in bioperl scale well. It is not
clear to me that their popgen module is a priority for them, and that
they carefully designed them (altough that might have changed in the
near past).
I also don' t believe that my own code (which I supplied you) is in
perfect shape to achieve this also.
I have to write down my ideas and send them here as soon as possible.
I will try to do it in the next couple of days at most.
The core idea is that there is no good abstract population and
individual objects, but they are also not needed.
What is needed, in my view, are file parsers and statistics.
Statistics should be organized in a systematic way. Example: all
frequency based, population-structure statistics should present the
same interface, something like:
   add_population(pop_name, individual_allele_list)
I will submit a small document for discussion very soon.

> I realize that it won't be the perfect solution: in fact, it is the same
> reason why I started this discussion here, the bioperl code wasn't optimized
> enought for what I want to do, but I didn't know how to modify perl modules
> and preferred python.

The important thing to notice is that biopython should not be
optimized to your needs or mine, it has to be general enough to
accomodate the vast majority of potential users. What I' ve always
tried, was to do things in a way that could be reused by others.


> Maybe we can just write a PED and GenePop parser and have let it work with
> GenePop and your modules to calculate Fst.

My suggestion would be for you to go ahead and do a Bio.PopGen.PED .
You could do it in the best way you see fit.
Converting from PED to genepop will make you loose information, if I
understand well (as you have SNP info on PED files, which you don' t
on genepop).
The other formats that I support (Fdist on released code and FStat on
the code that you have) are very similar (or less informative) than
genepop.
Again, my suggestion is for an independent parser. Of which you would
have absolute control as you would be the implementor.

I understand that this might lead to some duplicated code (like
split_in_pops), but repeated code is less of a problem than a generic
object that ends up being wrong in the long run.

> We should agree with a population object that could be used as input for
> GenePop.

For the reasons above I will fight a general Population object. At
least for now. I don't feel confident that we have the experience to
design one.
It is important to notice that we cannot break backward compatibility
without a very good reason. I think that a generic population object
will be severely resived in the future.
In your specific case I also think you would suffer with a population
object, as you need performance (parsing file, creating object,
extrating information from object, calculating statistic).
As I see it, it would be a smaller chain (parse, convert to statistic
family format, calculate statistic).

> I think it would be good anyway to release even incomplete code to the
> public, because it could be useful for other people.

Incomplete is OK. But I think we would be releasing wrong code. Code
that it would be redone in the future (and break interfaces with the
past versions). Also, a generic object would have performance problems
(it would have to be able to store all the information).

Well, I am ranting and not proposing a decent alternative. I will try
to write down something decent. I will try to write up a proposal
until Tuesday. I'm afraid the error is on my part: I have to write
down what is in my head so that people can discuss if it is a good
idea or not.


More information about the BioPython mailing list