[Biopython-dev] Building Gene Ontology support into Biopython

Sun Oct 18 10:34:21 UTC 2009

On Sun, Oct 18, 2009 at 6:22 AM, Chris Lasher <chris.lasher at gmail.com> wrote:
> I have a need to work with the gene ontology (GO) and gene ontology
> annotations (GOAs) for my research. It seems Biopython still lacks GO
> support despite a few threads from several years ago. I'd like to make
> GO support in Biopython a reality now. I would really appreciate any
> help and suggestions.

In terms of missing functionality, it would help me greatly if you
could describe the kind of things you want to achieve (and
therefore how it may or may not need to connect to existing
code like the SeqRecord and SeqFeature objects).

> Bioperl has solid GO support. I don't find their code straightforward
> at all; I haven't picked out what component is responsible for what
> task. Nonetheless, it could provide starting points to build support
> for Biopython.

Yeah - I think Hilmar commented on some of these threads. Doing
ontologies properly is hard work.

> Beyond looking through Bioperl code, though, I have several questions
> and I really welcome suggestions:
>
> 1) First off, does anyone have any gene ontology Python code
>   laying around?

Note quite what you wanted, but Ed Cannon has an OBO to OWL
parser in his github repository,
http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006701.html

> 2) What is the Biopython stance on introducing third-party
> dependencies? The gene ontology is represented a directed acyclic
> graph (DAG) and I want to use an existing graph library rather than
> roll our own. What would be the aversion to requiring either NetworkX
> or igraph as a dependency for the GO library. (I have experience with
> NetworkX and would prefer it, though I imagine igraph would be very
> similar for nearly all the methods we'd need access to to construct
> the DAG)

As Micheil said, we prefer to avoid 3rd party dependencies *especially*
build time ones. Wrappers for 3rd party command line tools are fine.

Currently we do have a number of optional python dependencies for
specific functionality - e.g. ReportLab for graphics, and assorted SQL
database backends. The python library NetworkX may fall into this
category. Adding another dependency should not be done lightly.

> 3) What are parsers written using these days? I checked the tutorial
> section on them
> (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc209) but
> this wasn't explicitly covered. Any pointers to recently written
> parsers? I seem to recall Biopython has moved away from Martel
> parsers, correct? Has anything been done with pyparsing or some other
> parser, or is it strictly manual now? Also, I'm welcoming tips on the
> architecture of parsers in general.

Martel is gone. Everything is done in plain python these days.
The coding styles vary - some are scanner/consumer, but using
iterators for large files (returning natural chunks of data in steps)
is normal. For things like XML, there are (several) parsers in the
python standard libraries.

> 4) Tying the GO Annotations to a fundamental Biopython data structure.
> This can't really be a SeqRecord object. SeqRecord.annotations makes
> sense, however, I can't guarantee a SeqRecord object will exist
> because the annotations don't come with the sequence itself. (A
> sequence is required to instantiate a SeqRecord object). Any
> suggestions on this?

Background to the task would help. Note you can create a
SeqRecord without a sequence, but it may not be sensible.
See for example the QUAL file parser which uses the new
UnknownSeq object where we just know the sequence
length.

> 5) BioSQL support. Not having used BioSQL in the past, I'm a bit wary
> of adding this feature, but it is implemented in Bioperl. I haven't
> yet figured out if it's used as the default data store for their
> parsers or if it is only an optional store.

I would describe BioSQL as an optional data store, particularly
suited to holding GenBank or EMBL files. Biopython has
BioSQL support (as do BioJava etc). We follow BioPerl and
use a loose ad-hoc ontology, but the BioSQL schema is
designed to allow proper ontologies. This is something I
have raised on the BioSQL mailing list.

Related to this, EMBOSS have done a lot of work mapping
between the ontologies used in GenBank, EMBL, UniProt
and the standard sequence ontology - something I'm hoping
we may be able to re-use in our planned support for GFF3
files.

Peter