[BioPython] calculate F-Statistics from SNP data

Peter biopython at maubp.freeserve.co.uk
Wed Oct 22 17:26:07 UTC 2008


On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

It sounds like for Ped files it would make more sense to iterate over
the individuals.  The mental picture I have in mind is a big
spreadsheet, individuals as rows (lines), populations (and other
information) as columns.  By having the parser iterate over the
individuals one by one, the user could then "simplify" each individual
as they are read in, recording in memory just the interesting data.
This way the whole dataset need not be kept in memory.

> I was also thinking of starting using a database to store data, instead of
> files. This would probably solve the problem of out of memory when parsing
> those long files.
> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

That would mean adding sqlalchemy as another (optional) dependency for
Biopython.  If you could use MySQLdb instead that would be better as
several existing modules use this.  However, I would encourage you to
avoid any database if possible because this makes the installation
much more complicated for the end user, and imposes your own arbitrary
schema as well.  It also means setting up suitable unit tests is also
a pain.

Peter



More information about the Biopython mailing list