[BioPython] [Biopython-dev] Statistics in population genetics module - Part I

Mon Nov 3 15:37:06 UTC 2008

On Fri, Oct 31, 2008 at 12:58 AM, Tiago Antão <tiagoantao at gmail.com> wrote:

> Hi,
>
> Statistics is the most important part of population genetics modules.
> In fact one could say that statistics where invented FOR population
> genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ).
> When I started to work on the population genetics module I decided to
> delay the statistics module a bit, in order to get experience with the
> whole biopython project before committing to do the most important
> thing.
> Irrespective of it is possible or not to link scipy or not, now seems
> to be the time to advance, especially considering that Giovanni is
> interested in participating.
> A few of points need to be said before suggesting on how to put
> statistics in Bio.PopGen
>
> 1. Whatever design is put in, it should be reasonably future proof: in
> a few releases it should not be a good idea to break older code. That
> should be avoided in as much as possible.

For how much time do you think a biopython module should be kept compatible
with older versions, more or less?
It will take a long time to develop the module, and it is sure that we will
make some mistakes. So, what is the best way to proceed? What if we create a
separated biopython branch where we can test all the new features?
At the moment I am working with a separated git repository for all the
popgen modules. The problem is that I didn't include all biopython modules
in the repository, so, if any of my changes breaks something in biopython, I
won't know it until I'll merge everything with biopython code.
On the other side, if I include a biopython release in my popgen repository,
I won't be able to track changes made in biopython, and my popgen code will
be compatible with that version only.
I think git provides some options to handle this kind of situations... I am
not very used to cvs, so I don't know.

p.s. When python3000 will be released, it will be probably necessary to
rewrite large portions of biopython, if not creating a 'biopython 2' version
(I think they were discussing something like this in bioperl's list).
I thought that maybe, even if we make some 'mistakes' in this version of
biopython, we will be able to fix them in a later version.

>
> 2. It goes without saying that the code should be useful to everybody
> doing population genetics and not only the authors of Bio.PopGen: all
> kinds of markers and population structures should be accommodatable in
> the future .

I think that a good idea would be starting collecting use cases to have an
idea how many things we'll have to implement in this module.
It would be useful to talk to the authors of similar modules in other Bio.*
projects, to see if they have some good suggestions.
I sent that mail to the Open::Bio::I last week, but still haven't received
many replies... I will send a message to the various Bio.* mailing list in
the next days.

- Show quoted text -

>
> 3. For reasons that I've partially explained on the biopython list, I
> don't think a OO model explicitly based on individuals or populations
> e good (or even necessary)
> 4. Any framework should be more pragmatic than anything else. I would
> envision a typical use case like this
>     a) read data (from a certain data source)
>     b) Do some basic processing (changing individuals or populations,
> converting markers)
>     c) calculate statistics
>     A few comments regarding each of these points:
>     a) data sources, file formats: file formats in population
> genetics exist in large quantities and are essencialy completely
> ad-hoc, most made in a very naive way. Good or BAD, that is what there
> is. The most used format (some kind of de facto standard, GenePop) can
> only be used for frequency-based statistics, for all the rest things
> are fragmented (although, if there are no population structure and the
> data is sequences than standard sequence based formats can be used -
> but from my experience this is a small minority)
>     b) basic processing: This is the point where a OO model of
> individuals and populations would pay, but I think it is not the "meat
> of the issue"
>     c) statistics: there are of every type and for every taste. If
> you want to have an idea of what is out there an interesting place to
> look at is the arlequin3 manual:
> http://cmpg.unibe.ch/software/arlequin3/arlequin31.pdf
> (part of the manual is UI description, but especially starting at page
> 89 - the table there is a good overview - there are descriptions of
> the overall panorama).

What if we create some very-generic objects, like:
Population
 self._to_popgen_input -> represents population as an input to popgen
([Pop1, (Ind1, Ind2...)])
 self._to_othertool_input -> represents population as an input to popgen

Thanks for the link to arlequin3 manual, it seems very informative.

>
> With time, and after at least 3 failed attempts to think in terms of
> individuals/populations I started to cristalize around a model
> centered on types of statistics. This model ends up actually having
> implicit models of populations and individuals, and that is, in fact,
> there. It is just implicit and not unified: different kinds of
> statistics have different implicit models.
> The model that I would like to propose, centered around statistics,
> will be the subject of my next email (which I will send in the next
> couple of days - still under design and lost sleep). I might split it
> in 2 parts (concepts and suggestions for implementation).
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it