[BioPython] calculate F-Statistics from SNP data

Thu Oct 23 11:10:51 EDT 2008

On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

GenePop works population by population. Where I a getting at, is that
different formats might have completely different strategies.
I've used a strategy with the FDist parser that it might be interesting to you:
1. I read the fdist file
2. Convert it to genepop
3. do all operations in the genepop format
4. convert back if necessary.

This might not work in your case because the ped format seems to be
more informative than the genepop format (and thus you loose
information in the conversion process). Feel free to copy and adapt my
code to your own (like split_in_pops and split_in_loci)

> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

You can go ahead and suggest formats for Populations and Individuals.
But I strongly suspect that your proposal will be biased towards your
needs (I've suffered the same problem myself). I think that in
biopython the idea is to try to have a solution that is useful to
everybody.

Also, if you want to put some SQL in the code module code, you will
have to have approval from the maintainers of biopython. They will
send you to the BioSQL people, which will say that there is none of
their business. Been there, done that, no success.

Don't take me wrong, I am not trying to discourage you in any way. But
I think it is better to gain some experience before proposing changes
to core concepts.
I've been doing this work for 3 years now, and I am convinced that it
would be very hard for me to suggest a good representation for
populations and individuals. Even populations are very hard to address
(like, some data is geo-referenced -> called landspace genetics, and
the more traditional one is not).

My suggestion: solve you problem the best way you can (e.g., do an
independent PED parser - you can use any of my code if you want).
Solve small problems, one after another.
Trying to solve the general problem is very hard and requires lots of
long term experience.