[BioPython] calculate F-Statistics from SNP data
Tiago Antão
tiagoantao at gmail.com
Thu Oct 23 14:51:22 UTC 2008
Hi,
> Moreover, there are some methods (like GenePop.Record.split_in_pops) that
> create Record objects, and I thought it would have been easier to always
> refer to the same one.
> Maybe we should write a generic PopGenRecord in which to store all general
> informations about population genetics data.
The problem with that is that it is
a) very difficult to come with a representation that is general enough
(and usable in the long run).
b) a general representation would be an hassle in specific cases
Let me elaborate:
Different kinds of genetic information have completely different
storage needs: If you are doing genomic studies you will probably want
to have location information (like this SNP is on chromosome X,
position Y). Others (probably the majority) only require frequency
information (or to know what the marker is, irrespective of position).
In most species you don't even know the genomic position of a certain
marker. So you would have to have an general representation capable to
handle both position information and no position information. Then, in
some cases, you need the whole marker (like if you want to do a Tajima
D) or just frequency information (for Fst). Some markers (microsats)
you can (in most, but not all) cases ignore the genetic pattern, you
just count the repeats.
You could argue that one could try to have a most general
representation but that entails three problems:
1. It is very difficult to come by with a clever, correct and future
proof representation. At least I've thinking on this issue since 2005
and have found no clever answer.
2. Performance: If you care about performance, having a most general
data representation will bring about a big performance cost
(converting from a certain general format to the format needed to do
computations).
3. Different formats and statistics have different requirements: For
instance on GenePop you don't have population names, neither the
marker itself, but for arlequin format you have partial information on
markers and full information on population names. converting the minor
differences among formats to a "general" format would be complex.
More information about the Biopython
mailing list