[BioPython] calculate F-Statistics from SNP data

Sun Oct 26 01:34:55 UTC 2008

I just want add on an extra comment explaining why I oppose doing an
individual object:

I have the following questions (and others) in my mind, which I don't
know the answer. I am not looking for answers to them, I am just
trying to illustrate the difficulty of the problem.

1. For a certain marker, do we store the genomic position of the
marker? Some (most) statistics don't use this information. For many
species this information is not even available. But for some
statistics this information is mandatory...
2. For a microsatellite do we store the motif and number of repeats or
the whole sequence? (see 4)
3. If one is interested in SNPs and one has the full sequences does
one store the full sequences or just the SNPs? If you store just the
SNPs then you cannot do sequence based analysis in the future (say
Tajima D). If you store everything then you are consuming memory and
cpu.
4. If one just wants to do frequency statistics (Fst), do you store
the marker or just the assign each one an ID and store the ID? It is
much cheaper to store an ID than a full sequence.

Populations
1. Support for landscape genetics? I mean geo-referentiation
2. Support for hierarchical population structure?
3. Do we cache statistics results on Population objects?

Let me take your class marker:
class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0

How would this be useful for microsatellites? Why purines, and if my
marker is a protein? If it is a SNP I want to know the nucleotide? And
if I am studying proteins and I want to have the aminoacid?

Dont take me wrong, I have done this path. To solve my particular
problems is not very hard. To have a framework that is usable by
everybody, it is a damn hard problem. And we dont really need to solve
it (ok, it would be nice to do things to populations in general, that
I agree). But the fundamental is: read file, calculate statistics.
That doesnt need population and individual objects.

If we end up having too many formats a consolidation step might be
needed in the future (to avoid having 10 split_in_pops). That I agree.