[BioPython] [PopGen] HapMap

Fri Nov 14 07:16:20 EST 2008

> For this particular bit of HapMap code, do you need persistence?  If
> all you need is an on the fly database there may be other options
> (maybe sqlite - some versions of python ship with this).

Considering that there are now more people here that seem to be
interested in this, maybe this can be discussed.

The HapMap is a fairly big database of SNPs taken for 3 (or 4, depends
on how you count) human populations. The database is available in text
format. If I recall well (this is old code and old work) there is a
file per chromosome and per pop with a (big) list of SNPs. Actually
there are several files, from allele counts to haplotype
reconstruction. The problem is, if you want to search for a certain
criteria, (say SNPID, a chunk of a chromosome, or whatever) going
through the files is a painfully slow process.

My (now very old) implementation (which, I think is on GIT), downloads
the text files, uploads then on a local sqllite database, indexes it
and exposes a fast interface. The code is actually quite agile, making
life quite easy on downloading and manipulating data, at least in my
opinion.

If there is interest here, I can pull out my code and we can discuss
the approach that I followed in the past. Also, if somebody else wants
to take the lead on this, go ahead (you can still use my code). To be
honest I would prefer to have a shared discussion on this, then just
submitting the code alone, with just my own reasoning to back it.