[BioPython] calculate F-Statistics from SNP data

Thu Oct 30 17:36:00 EDT 2008

Hi,

FYI, I am going to continue this discussion to biopython-dev, as I
think it makes more sense there. Especially the parts about
implementation suggestions.

On Sun, Oct 26, 2008 at 1:34 AM, Tiago Antão <tiagoantao at gmail.com> wrote:
> I just want add on an extra comment explaining why I oppose doing an
> individual object:
>
> I have the following questions (and others) in my mind, which I don't
> know the answer. I am not looking for answers to them, I am just
> trying to illustrate the difficulty of the problem.
>
> 1. For a certain marker, do we store the genomic position of the
> marker? Some (most) statistics don't use this information. For many
> species this information is not even available. But for some
> statistics this information is mandatory...
> 2. For a microsatellite do we store the motif and number of repeats or
> the whole sequence? (see 4)
> 3. If one is interested in SNPs and one has the full sequences does
> one store the full sequences or just the SNPs? If you store just the
> SNPs then you cannot do sequence based analysis in the future (say
> Tajima D). If you store everything then you are consuming memory and
> cpu.
> 4. If one just wants to do frequency statistics (Fst), do you store
> the marker or just the assign each one an ID and store the ID? It is
> much cheaper to store an ID than a full sequence.
>
> Populations
> 1. Support for landscape genetics? I mean geo-referentiation
> 2. Support for hierarchical population structure?
> 3. Do we cache statistics results on Population objects?
>
>
> Let me take your class marker:
> class Marker:
>  total_heterozygotes_count = 0
>  total_population_count = 0
>  total_Purines_count = 0 # this could be renamed, of course
>  total_Pyrimidines_count = 0
>
> How would this be useful for microsatellites? Why purines, and if my
> marker is a protein? If it is a SNP I want to know the nucleotide? And
> if I am studying proteins and I want to have the aminoacid?
>
> Dont take me wrong, I have done this path. To solve my particular
> problems is not very hard. To have a framework that is usable by
> everybody, it is a damn hard problem. And we dont really need to solve
> it (ok, it would be nice to do things to populations in general, that
> I agree). But the fundamental is: read file, calculate statistics.
> That doesnt need population and individual objects.
>
> If we end up having too many formats a consolidation step might be
> needed in the future (to avoid having 10 split_in_pops). That I agree.
>

-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
—Matthew Simmons,
http://www.tiago.org