[Biopython-dev] Biopython 1.60 plans and beyond

Tiago Antão tiagoantao at gmail.com
Sat Feb 18 05:53:37 EST 2012


On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> So, what other cool things are you all working on,
> and in particular what is ready or near-ready for
> inclusion with Biopython this year?

I have changed job 3 months ago and that has meant that I have been in
a hell-hole of over-work for the last 3 months. A hell-hole that I now
have crawled out (and with lots of new code written). I am now doing
(more standard?) human evolutionary genetics.

My previous experience with donating code to Biopython has me with
mixed feelings: people use the applications a lot but very rarely the
code directly (a cursory look at the citations of the applications vs
the citations of Bio.PopGen clearly shows that).

I now have written a LOT of code in slightly different areas, these
might (or not) interest people:

1. Phasing/imputation: code to parse/convert between Beagle, Shapeit,
phase, impute2. Typically to analyse SNP chips of human data (say
between 0.5 and 1.5 Million SNPs per individual, 1000s of

2. plink: code to parse plink output files (quite trivial). People use
plink a lot?

3. GO code: I would REALLY like to start a discussion on what should
be a proper GO approach. In my case I am doing gene enrichment
analysis. I might start a thread or a blog post on this...

4. Code to do multi-tasking. Actually Bio.PopGen has a scheduler to do
multiple (external) tasks at the same time, but I have written a new
one. Maybe the code does not belong into biopython, but a discussion
could be done around such a issue (I suppose people doing analysis of
lots of data have been having that problem, not just me?).

5. Some ensembl variation code: things like getting the ancestral SNP
(versus the derived) or getting all the stuff (genes mostly) in a
certain window position of the genome.

On a side, I still have to do a 64 bit Windows port (remember?). This
will have to be done from home, as my now work computer is a Pentium 4
(not precisely a modern 64 bit machine ;) )

Another issue that has been crossing my mind regards the inclusion of
new code: In my case I would really like to have something like a
"beta" version of the API: ie releasing something that is deemed
"unstable" API wise (to get comments from the community) and then
stabilize it. Concretly: in the first/second version people should
expect the API to not be stable and have changes.

Another side issue that I would like to discuss (maybe a different
thread): Is how people are coping with large amounts of data using
Python (or Perl/Ruby for that matter)? Specifically the problem of
performance. As I see it, there is more and more the case of depending
on external (fast) programs or CLib extensions or Java extensions to
do the bulk of the work. Inner-loops in Python simply do not cut for

In the near future (this year) I will probably also be working with
sequence data (BAM and VCF stuff might resurface)

All for now,

"If you want to get laid, go to college.  If you want an education, go
to the library." - Frank Zappa

More information about the Biopython-dev mailing list