[Biopython-dev] Building Gene Ontology support into Biopython

Sun Oct 18 04:05:10 EDT 2009

--- On Sun, 10/18/09, Chris Lasher <chris.lasher at gmail.com> wrote:
> I'd like to make GO support in Biopython a reality now.

That would be nice.

> Bioperl has solid GO support. I don't find their code
> straightforward at all; I haven't picked out what component is 
> responsible for what task.

To arrive at a good design of a Biopython module, sometimes it helps to write its documentation first, before writing the actual code.

> 2) What is the Biopython stance on introducing third-party
> dependencies?

I think we should avoid them as much as possible. In addition to the additional hassle for users and developers, unforeseen changes in third-party dependencies may break your module.

> What would be the aversion to requiring either NetworkX or igraph
> as a dependency for the GO library.

Are these Python modules or C software? Do NetworkX or igraph have their own third-party dependencies? Do we need the full NetworkX or igraph or just a part of it? In the latter case, assuming that these are open-source software packages, we may simply include the parts we need into Biopython. Also, how far do you get by using NumPy?

> 3) What are parsers written using these days?

Current parsers typically work as follows, assuming that a data file contains exactly one record:

>>> handle = open("mydatafile")
>>> from Bio import SomeModule
>>> record = SomeModule.read(handle)
# record is now a SomeModule.Record object

If one data file typically contains multiple records, use a "parse" function to return an iterator:

>>> handle = open("mydatafile")
>>> from Bio import SomeModule
>>> records = SomeModule.parse(handle)
>>> for record in records:
...     # record is now a SomeModule.Record object

> Any pointers to recently written parsers?

Bio.SeqIO.read and parse are good examples. Also you can look at Bio.Medline for a simple parser using this approach.

> I seem to recall Biopython has moved away from Martel
> parsers, correct?

Yes.

> Has anything been done with pyparsing or some other
> parser, or is it strictly manual now?

Not as far as I know.

> Also, I'm welcoming tips on the
> architecture of parsers in general.

See above. Also note that few parsers nowadays use Bio.ParserSupport. This was previously used to implement parsers in Biopython (with parsers, scanners, and consumers). I would avoid Bio.ParserSupport and simply write a straightforward parser using the Python standard library.

> 4) Tying the GO Annotations to a fundamental Biopython data
> structure.
> Any suggestions on this?

A SeqRecord doesn't seem to be appropriate for gene ontology.
How about a Record class specifically for GO?
Also, what should such a class contain?

Best,

--Michiel.