[BioPython] calculate F-Statistics from SNP data

Thu Oct 23 16:25:29 UTC 2008

On Thu, Oct 23, 2008 at 5:10 PM, Tiago Antão <tiagoantao at gmail.com> wrote:

> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Iterators are more difficult to implement in Ped files, because in this
> > format every line of the file is an individual, so to write an iterator
> > which iterates by population we will need to read at list the first row
> of
> > every line of all the file.
>
> GenePop works population by population. Where I a getting at, is that
> different formats might have completely different strategies.
> I've used a strategy with the FDist parser that it might be interesting to
> you:
> 1. I read the fdist file
> 2. Convert it to genepop
> 3. do all operations in the genepop format
> 4. convert back if necessary.
>
> This might not work in your case because the ped format seems to be
> more informative than the genepop format (and thus you loose
> information in the conversion process). Feel free to copy and adapt my
> code to your own (like split_in_pops and split_in_loci)
>
>
> > I would probably use sqlalchemy to interface with this database: this is
> why
> > I would like to implement a Population and Individual objects, it will
> fit
> > better with relational mapping.
>
> You can go ahead and suggest formats for Populations and Individuals.
> But I strongly suspect that your proposal will be biased towards your
> needs (I've suffered the same problem myself). I think that in
> biopython the idea is to try to have a solution that is useful to
> everybody.
>
> Also, if you want to put some SQL in the code module code, you will
> have to have approval from the maintainers of biopython. They will
> send you to the BioSQL people, which will say that there is none of
> their business. Been there, done that, no success.
>
> Don't take me wrong, I am not trying to discourage you in any way. But
> I think it is better to gain some experience before proposing changes
> to core concepts.
> I've been doing this work for 3 years now, and I am convinced that it
> would be very hard for me to suggest a good representation for
> populations and individuals. Even populations are very hard to address
> (like, some data is geo-referenced -> called landspace genetics, and
> the more traditional one is not).
>
> My suggestion: solve you problem the best way you can (e.g., do an
> independent PED parser - you can use any of my code if you want).
> Solve small problems, one after another.
> Trying to solve the general problem is very hard and requires lots of
> long term experience.
>

Well, I agree with you... I don't have any idea on how this problem could be
resolved :).
However I think it would be good to add to biopython at least some
funcionality to calculate Fst statistics and parse these file formats, at
least at the level at which BioPerl does.
What if we just translate the same functionalities and copy the population
objects from bioperl into biopython?
I realize that it won't be the perfect solution: in fact, it is the same
reason why I started this discussion here, the bioperl code wasn't optimized
enought for what I want to do, but I didn't know how to modify perl modules
and preferred python.

Maybe we can just write a PED and GenePop parser and have let it work with
GenePop and your modules to calculate Fst.
We should agree with a population object that could be used as input for
GenePop.
I think it would be good anyway to release even incomplete code to the
public, because it could be useful for other people.

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it