[Biopython-dev] OBO parser & DAG

Wed Jan 8 21:05:47 UTC 2014

Hi,

I'll answer below (even though I do have a bad habit of top-posting my
answers, sorry).

On Wed, Jan 8, 2014 at 5:55 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

>I wrote about not using networkx as the main data structure for ontologies

>
> I understand your rationale, but I disagree with it, mainly for design
> reasons.
>
> Ok. I guess I was a bit too brief with my explanation. We have considered
using networkx and decided not to use it mainly because it was not very
useful, and implementing what was neccessary for parsing was not an issue
for Kamil. Networkx is currently not a dependency for biopython or for
bio.phylo and it is not even listed as an "optional software"  along with
reportlab and such (http://biopython.org/wiki/Download I guess it should
after Eric's comment). My understaing is that this is a policy of thinking
twice before adding something additional as a dependency, because we would
need to care for compatibility with different networkx versions.  Taking
Bio.Phylo as an example I do think it is a good policy to keep such
libraries as optional if possible. Besides, we did not particularly liked
the way digraphs are implemented in Networkx with a heavy use of
dictionaries, as this might get slow for large dictionaries.
Now to your specific points:

> 1. Enrichment analysis is only one of many different applications that can
> be performed with GO. Therefore, saying that features are unnecessary
> because a particular use case does not require them should not be a design
> consideration for a module that is intended for general use. Rather, a
> generic package handling ontologies should be just that: generic, and
> disengaged from any kind of application. Therefore, if your package is
> intended for biopython the use-case (enrichment analysis) should be
> decoupled from the parser + data structure.
>
> We were obviously tailoring this to our needs, but I have to disagree with
your argument. Because of the reasons above, I think that we should use
external digraph library _only_ if it is _necessary_ for the parsing and
storing and it clearly isn't.

For the separation of parsing and enrichment, we do  want to keep the
parser separate from the enrichment analysis and I thought it was quite
clear with the use of separate classes, but we are absolutely open to
discuss how to organize these modules.

If you take the parser as a separate module - using networkx is even less
needed(really no need for a big graph manipulation lib here).

> 2. The graph features that you wrote in Digraph exist in networkx anyway,
> or am I missing something? So why not take advantage of nx instead of
> redoing it even if it does have many redundant (for you) graph manipulation
> & diagnostic features? Someone else may want to use these features,
> including the graphics nx provides, etc.
>

Yes, the point is that parsing, storing a digraph is a simple thing and
there is no need to add a large library for that. If there was a digraph
library in biopython, it would be stupid not to use it, but I don't feel we
need to add a dependency here.

>
>> However it would be very easy to make functions for converting our
>> ontologies to networkx digraphs, either with or without gene annotations as
>> additional attributes.
>>
>>
> Well, the idea is actually to maintain ontologies as nx digraphs. Yes, I
> agree there.
>
>
That's also exactly what is done in the Bio.Phylo. We are planning to write
a function analogous to Bio.Phylo._util.to_networkx() which would take a
simple digraph obtained from parsing an OBO file and give you a networkx
digraph with all the data for manipulation.

>  As for support for different types of transitivity in relations of
>> different type (as in your inference of ancestry for is_a and part_of
>> relations) we are currently not supporting it, but after thinking about it,
>> we will make a change to support this feature. Probably we will let the
>> user to (optionally) define the transitivity between relationship types
>> (i.e. is_a + part_of becomes part_of, etc).
>>
>> In general, it would be very helpful if you could give us some rough idea
>> about your expected use cases. For example: are you expecting to modify the
>> graphs in the networkx objects? What will you use the inferred ancestor
>> lists for? So that the changes we make will be as useful to the community
>> as possible.
>>
>
>
> The idea is that expected use cases should not impact the design of a
> basic parser + data structure. In my lab, we are looking at inferred
> ancestors lists to calculate semantic similarity, but it really doesn't
> matter what we (or anyone) will end up using the GO module for. If you
> provide enrichment analysis *on top* of the parser + data structure (as a
> separate module), and we provide semantic similarity (again as a separate
> module *on top* of the parser + data structure) those are nice bonuses. But
> the parser + data structure should be as general as possible. That is:
> include all the information in the OBO file, placed in a digraph structure
> that can be comprehensively interrogated, visualized and manipulated (which
> is what nx offers).
>
> I was unfortunately not very clear here. What I meant was that we were
considering what is necessary for typical uses of ontologies were parsing,
and accessing the terms. And I think that is valid in the sense that
majority of users is treating Ontologies as read-only data (not that many
biopython users are making their own ontologies, otherwise, it would have
been implemented ages ago...).

As for the second argument: I do fully agree that there should be some
separation between ontology and annotation reading and any functionality
"on-top" of it. But I think that this would be not a reasonable thing to do
to include networkx as the main data structure. Currently there is only one
library that biopython depends on and it is numpy. I do not see networkx as
equaly important. I think that we should go the way paved by the bio.phylo
and use the simple digraph (which already holds all the information from
the OBO files afaik) for parsing output and convert it to networkx where
necessary.

best
Bartek
-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek