[Biopython-dev] Biopython 1.60 plans and beyond

Tiago Antão tiagoantao at gmail.com
Sat Feb 18 10:53:37 UTC 2012


Hello,

On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> So, what other cool things are you all working on,
> and in particular what is ready or near-ready for
> inclusion with Biopython this year?

I have changed job 3 months ago and that has meant that I have been in
a hell-hole of over-work for the last 3 months. A hell-hole that I now
have crawled out (and with lots of new code written). I am now doing
(more standard?) human evolutionary genetics.

My previous experience with donating code to Biopython has me with
mixed feelings: people use the applications a lot but very rarely the
code directly (a cursory look at the citations of the applications vs
the citations of Bio.PopGen clearly shows that).

I now have written a LOT of code in slightly different areas, these
might (or not) interest people:

1. Phasing/imputation: code to parse/convert between Beagle, Shapeit,
phase, impute2. Typically to analyse SNP chips of human data (say
between 0.5 and 1.5 Million SNPs per individual, 1000s of
individuals).

2. plink: code to parse plink output files (quite trivial). People use
plink a lot?

3. GO code: I would REALLY like to start a discussion on what should
be a proper GO approach. In my case I am doing gene enrichment
analysis. I might start a thread or a blog post on this...

4. Code to do multi-tasking. Actually Bio.PopGen has a scheduler to do
multiple (external) tasks at the same time, but I have written a new
one. Maybe the code does not belong into biopython, but a discussion
could be done around such a issue (I suppose people doing analysis of
lots of data have been having that problem, not just me?).

5. Some ensembl variation code: things like getting the ancestral SNP
(versus the derived) or getting all the stuff (genes mostly) in a
certain window position of the genome.

On a side, I still have to do a 64 bit Windows port (remember?). This
will have to be done from home, as my now work computer is a Pentium 4
(not precisely a modern 64 bit machine ;) )

Another issue that has been crossing my mind regards the inclusion of
new code: In my case I would really like to have something like a
"beta" version of the API: ie releasing something that is deemed
"unstable" API wise (to get comments from the community) and then
stabilize it. Concretly: in the first/second version people should
expect the API to not be stable and have changes.

Another side issue that I would like to discuss (maybe a different
thread): Is how people are coping with large amounts of data using
Python (or Perl/Ruby for that matter)? Specifically the problem of
performance. As I see it, there is more and more the case of depending
on external (fast) programs or CLib extensions or Java extensions to
do the bulk of the work. Inner-loops in Python simply do not cut for
speed.

In the near future (this year) I will probably also be working with
sequence data (BAM and VCF stuff might resurface)

All for now,
T

-- 
"If you want to get laid, go to college.  If you want an education, go
to the library." - Frank Zappa




More information about the Biopython-dev mailing list