[Biopython] Why so few recipes in the cookbook?

Fri Dec 18 15:00:13 UTC 2009

2009/12/18 Daniel Silvestre <daniel at dim.fm.usp.br>:
> Hi people,
>
> Actually, even the tutorial is a collection of snippets. I do consider
> and regard the effort. But, in order to attract biologists like myself
> and my colleagues we need something more pragmatic, problem
> driven.

Most of the tutorial is by its nature "snippets" but the Cookbook
chapter examples are more self standing. I suspect you are
looking for even more self contained things - complete examples
with a motivating rational, sample input data, etc.

> The prototipical workflow of a molecular biologist is:
>
>  - Select a bunch of interesting genes in Entrez by clicking buttons and
> boxes;
>
>  - BLAST some sequences and save the results in separated directories,
> normally one for each gene;
>
>  - Struggle to extract useful statistics from the results, wich usually
> end in sorting and selecting the first few results;
>
>  - Apply some analytical method (phylogeny reconstruction, mutation
> analysis, etc.) over the "filtered" results;
>
>  - Restart the cycle until get satisfied or bored;
>
> By the way, in one of my classes I just taught the students (which can
> be grad students and professors) to use the fields of Entrez (molecular
> weight, range search, organism name, etc.) and they felt really powerful
> after that. For instance, they used to retrieve sequence lists of papers
>  by hand !!!

I confess I don't know or use the full power of the Entrez website,
although that is in part since I can do clever stuff via their API ;)

> On the other hand, the ones who dare to use biopython tipically don't
> know how to glob things and other administrivia. So, without a real
> example only biology geeks like me get to the next step.
>
> There is a list of good recipes to start the cookbook:
>
>  - How to retrieve and organize sequences and annotations from online
> databases using you own custom command line tool;

We touch on some of this already, e.g. search and retrieve examples
in the Bio.Entrez chapter of the tutorial. Are you looking for something
more in depth? Or using other databases?

>  - How to setup/insert/retrieve a bunch of results into a local
> (personal) database (SQL);

Done, although not tagged as a cookbook specifically:
http://www.biopython.org/wiki/BioSQL

(The tutorial also points to this page)

>  - How to annotate retrieved results with your own results;

Now here I'd like a little clarification about what you want to do.

My guess would be something I have considered working up
into a cookbook recipe, based one stuff I have already done:
Taking a small genome (viral or prokaryote), doing simple
gene predictions (e.g. ORF finding, pick first start codon,
or maybe calling a command line tool to do it for us), then
taking the predicted peptides and BLASTing them, then
making a GenBank file with these predicted features and
stick a summary of the BLAST results in their annotation.

However, while this is a reasonable first step, there are
downsides to encouraging this sort of naive approach to
annotation - the example would ideally have "Further
Reading" section, see for example Schnoes et al 2009.
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000605

> These are real problems faced by the common biologist. The proposed
> snippets in the tutorial and the cookbook is already dealt by a lot of
> web tools. It's absolutely necessary to show that biopython can increase
> the power and range of a biologist everyday work, and can possibly be
> automated.
>
> I have some examples to obtain statistics over genome sequences which
> address complete examples (including globbing filelists, retrieving from
> online databases, etc.) and can prepare them as a recipe. But, I could
> use some help . . .

If you start a cookbook entry on the wiki, and some outline
code, I'm sure we can as a group contribute ideas and tips
(particularly in the code, but maybe in the approach too). Or,
if you would rather, discuss some specific ideas here on the
mailing list first.

Note that some of these topics would be ideal for an OBF
project wide set of examples, with reference solutions in
Biopython, BioPerl, BioJava, BioRuby, etc. That is however
a much much bigger task.

Peter