[Biopython-dev] Statistics in population genetics module - Part I

Thu Oct 30 23:58:57 UTC 2008

Hi,

Statistics is the most important part of population genetics modules.
In fact one could say that statistics where invented FOR population
genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ).
When I started to work on the population genetics module I decided to
delay the statistics module a bit, in order to get experience with the
whole biopython project before committing to do the most important
thing.
Irrespective of it is possible or not to link scipy or not, now seems
to be the time to advance, especially considering that Giovanni is
interested in participating.
A few of points need to be said before suggesting on how to put
statistics in Bio.PopGen

1. Whatever design is put in, it should be reasonably future proof: in
a few releases it should not be a good idea to break older code. That
should be avoided in as much as possible.
2. It goes without saying that the code should be useful to everybody
doing population genetics and not only the authors of Bio.PopGen: all
kinds of markers and population structures should be accommodatable in
the future .
3. For reasons that I've partially explained on the biopython list, I
don't think a OO model explicitly based on individuals or populations
e good (or even necessary)
4. Any framework should be more pragmatic than anything else. I would
envision a typical use case like this
     a) read data (from a certain data source)
     b) Do some basic processing (changing individuals or populations,
converting markers)
     c) calculate statistics
     A few comments regarding each of these points:
     a) data sources, file formats: file formats in population
genetics exist in large quantities and are essencialy completely
ad-hoc, most made in a very naive way. Good or BAD, that is what there
is. The most used format (some kind of de facto standard, GenePop) can
only be used for frequency-based statistics, for all the rest things
are fragmented (although, if there are no population structure and the
data is sequences than standard sequence based formats can be used -
but from my experience this is a small minority)
     b) basic processing: This is the point where a OO model of
individuals and populations would pay, but I think it is not the "meat
of the issue"
     c) statistics: there are of every type and for every taste. If
you want to have an idea of what is out there an interesting place to
look at is the arlequin3 manual:
http://cmpg.unibe.ch/software/arlequin3/arlequin31.pdf
(part of the manual is UI description, but especially starting at page
89 - the table there is a good overview - there are descriptions of
the overall panorama).

With time, and after at least 3 failed attempts to think in terms of
individuals/populations I started to cristalize around a model
centered on types of statistics. This model ends up actually having
implicit models of populations and individuals, and that is, in fact,
there. It is just implicit and not unified: different kinds of
statistics have different implicit models.
The model that I would like to propose, centered around statistics,
will be the subject of my next email (which I will send in the next
couple of days - still under design and lost sleep). I might split it
in 2 parts (concepts and suggestions for implementation).