[BioPython] Biopython and population genetics

14 Sep 2001 19:45:26 -0700

Hi folks,

Let me introduce myself: I'm a recently converted Pythoneer/Pythonista
working in computational biology/bioinformatics in several different
areas, but most recently focused on bioinformatics in the context of
population genetics analysis.

As part of an NIH funded study we are currently implementing a
Python-based framework for analysing population genetics data files
and generating statistics such as linkage disequilibrium, and
Hardy-Weinberg tests based on the parsing of these files.

At this stage it is not 100% certain that we will be able to release
this as an open source effort, although it is more than likely that we
will be able to.  Our university has an license policy that most
closely resembles the XFree86-style license (i.e. BSD without the
"advertising clause").

A typical "population genetics" data file, consists of multi-locus
genotyped data, which is to say a series of lines, each consisting of
an individual with a tab- or space-delimited series of loci (with two
alleles per loci).  e.g.:

Individual_ID  Locus_A    Locus_B  Locus_C

10323          A32  A12   B3  B4   C9 C10
10324          A2   A12   B5  B1   C1 C10

... etc.

The alleles ("A32", "A12", "B3", etc.)  can sometimes consist of
high-resolution data (e.g. actual DNA or RNA sequences) or lower-level
data (such as "DQP1003", which are simply nomenclature strings, often
termed "allele calls", because the molecular typing techniques using
"kits", as they need to be carried out on large numbers of
individuals, aren't as accurate as full-sequencing, or at least that's
my understanding of the in-vitro aspect, being a strictly in-silico
person myself ;-)).

Unfortunately unlike the NCBI/Genbank and other formats there are no
generally accepted formats for pop-gen multi-locus data for many
individuals (please correct me if I'm wrong), however there are a few
attempts at standardisation, most notably the PGDB project at the
University of Louisiana:

 http://seahorse.louisiana.edu/PGDB/

which has attempted a tentative XML file-format standard.

Currently I'm implementing a parser for the fairly simple-minded, but
reasonably common format as described above, but it currently is
independent of biopython.  Ideally, it would be a subclass of
"AbstractParser" or somesuch (and would ultimately recognise the XML
version, when one hopefully becomes standard), and could return
appropriate Sequence or SeqFeature objects.

If not all of our code can be released, then at the very least,
perhaps some of our modules could be contributed directly to biopython
or be made as "optional" add-ons to biopython.

In any case, I have been investigating biopython and would like to get
some feedback from the community: 

* pointers to existing work in Python (or elsewhere) along these lines
  (if there is any);

* suggestions about how such parsing modules might be most elegantly
  integrated into the biopython framework;

* comments from others who deal in pop. genetics analysis and have
  experience and/or suggestions for data formats in pop. genetics and
  other multi-locus genotype data, and whether there is sufficient
  interest in such a parser to warrant the effort to generalise it.

Thanks,

Alex