[Biopython-dev] OBO parser & DAG

Bartek Wilczynski bartek at rezolwenta.eu.org
Thu Jan 9 23:06:30 UTC 2014


Hi,

I think that's a great soution. In fact Kamil has now implemented such
conversion. Take a look if this suits your needs:
https://github.com/tosterovic/biopython/commit/904e51303391411b42697205b09181378662807d

We would be very happy to contribute this module to biopython repo, so it
would be great if more people would take a look and suggest changes needed
for accepting this as a part of biopython

best
Bartek


On Wed, Jan 8, 2014 at 10:59 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

> So it seems like we are debating minimal external dependencies vs.
> maximizing functionality.
>
> How about the following: the OBO file will be read into an independent,
> basic digraph like Bartek's team has already constructed.
>
> But we will also have the ability to transfer the biopython DAG into a
> networkx DAG, so that anyone wishing to play elaborate games with the
> ontology structure (as we do), can do so without re-inventing the wheel.
>
> How does that sound?
>
> One thing about networkx: I still really, really like it :), and we
> started writing the digraph based on it because of this page:
>
> http://biopython.org/wiki/Gene_Ontology#GO_Directed_Acyclic_Graph
>
> But the fact that this spec using networkx has been written does not have
> to commit us to this particular design.
>
>
>
> On Wed, Jan 8, 2014 at 4:05 PM, Bartek Wilczynski <
> bartek at rezolwenta.eu.org> wrote:
>
>> Hi,
>>
>> I'll answer below (even though I do have a bad habit of top-posting my
>> answers, sorry).
>>
>>
>> On Wed, Jan 8, 2014 at 5:55 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>>
>> >I wrote about not using networkx as the main data structure for
>> ontologies
>>
>>>
>>> I understand your rationale, but I disagree with it, mainly for design
>>> reasons.
>>>
>>> Ok. I guess I was a bit too brief with my explanation. We have
>> considered using networkx and decided not to use it mainly because it was
>> not very useful, and implementing what was neccessary for parsing was not
>> an issue for Kamil. Networkx is currently not a dependency for biopython or
>> for bio.phylo and it is not even listed as an "optional software"  along
>> with reportlab and such (http://biopython.org/wiki/Download I guess it
>> should after Eric's comment). My understaing is that this is a policy of
>> thinking twice before adding something additional as a dependency, because
>> we would need to care for compatibility with different networkx versions.
>> Taking Bio.Phylo as an example I do think it is a good policy to keep such
>> libraries as optional if possible. Besides, we did not particularly liked
>> the way digraphs are implemented in Networkx with a heavy use of
>> dictionaries, as this might get slow for large dictionaries.
>> Now to your specific points:
>>
>>
>>> 1. Enrichment analysis is only one of many different applications that
>>> can be performed with GO. Therefore, saying that features are unnecessary
>>> because a particular use case does not require them should not be a design
>>> consideration for a module that is intended for general use. Rather, a
>>> generic package handling ontologies should be just that: generic, and
>>> disengaged from any kind of application. Therefore, if your package is
>>> intended for biopython the use-case (enrichment analysis) should be
>>> decoupled from the parser + data structure.
>>>
>>> We were obviously tailoring this to our needs, but I have to disagree
>> with your argument. Because of the reasons above, I think that we should
>> use external digraph library _only_ if it is _necessary_ for the parsing
>> and storing and it clearly isn't.
>>
>> For the separation of parsing and enrichment, we do  want to keep the
>> parser separate from the enrichment analysis and I thought it was quite
>> clear with the use of separate classes, but we are absolutely open to
>> discuss how to organize these modules.
>>
>> If you take the parser as a separate module - using networkx is even less
>> needed(really no need for a big graph manipulation lib here).
>>
>>
>>> 2. The graph features that you wrote in Digraph exist in networkx
>>> anyway, or am I missing something? So why not take advantage of nx instead
>>> of redoing it even if it does have many redundant (for you) graph
>>> manipulation & diagnostic features? Someone else may want to use these
>>> features, including the graphics nx provides, etc.
>>>
>>
>> Yes, the point is that parsing, storing a digraph is a simple thing and
>> there is no need to add a large library for that. If there was a digraph
>> library in biopython, it would be stupid not to use it, but I don't feel we
>> need to add a dependency here.
>>
>>
>>>
>>>> However it would be very easy to make functions for converting our
>>>> ontologies to networkx digraphs, either with or without gene annotations as
>>>> additional attributes.
>>>>
>>>>
>>> Well, the idea is actually to maintain ontologies as nx digraphs. Yes, I
>>> agree there.
>>>
>>>
>> That's also exactly what is done in the Bio.Phylo. We are planning to
>> write a function analogous to Bio.Phylo._util.to_networkx() which would
>> take a simple digraph obtained from parsing an OBO file and give you a
>> networkx digraph with all the data for manipulation.
>>
>>
>>
>>>  As for support for different types of transitivity in relations of
>>>> different type (as in your inference of ancestry for is_a and part_of
>>>> relations) we are currently not supporting it, but after thinking about it,
>>>> we will make a change to support this feature. Probably we will let the
>>>> user to (optionally) define the transitivity between relationship types
>>>> (i.e. is_a + part_of becomes part_of, etc).
>>>>
>>>> In general, it would be very helpful if you could give us some rough
>>>> idea about your expected use cases. For example: are you expecting to
>>>> modify the graphs in the networkx objects? What will you use the inferred
>>>> ancestor lists for? So that the changes we make will be as useful to the
>>>> community as possible.
>>>>
>>>
>>>
>>> The idea is that expected use cases should not impact the design of a
>>> basic parser + data structure. In my lab, we are looking at inferred
>>> ancestors lists to calculate semantic similarity, but it really doesn't
>>> matter what we (or anyone) will end up using the GO module for. If you
>>> provide enrichment analysis *on top* of the parser + data structure (as a
>>> separate module), and we provide semantic similarity (again as a separate
>>> module *on top* of the parser + data structure) those are nice bonuses. But
>>> the parser + data structure should be as general as possible. That is:
>>> include all the information in the OBO file, placed in a digraph structure
>>> that can be comprehensively interrogated, visualized and manipulated (which
>>> is what nx offers).
>>>
>>> I was unfortunately not very clear here. What I meant was that we were
>> considering what is necessary for typical uses of ontologies were parsing,
>> and accessing the terms. And I think that is valid in the sense that
>> majority of users is treating Ontologies as read-only data (not that many
>> biopython users are making their own ontologies, otherwise, it would have
>> been implemented ages ago...).
>>
>> As for the second argument: I do fully agree that there should be some
>> separation between ontology and annotation reading and any functionality
>> "on-top" of it. But I think that this would be not a reasonable thing to do
>> to include networkx as the main data structure. Currently there is only one
>> library that biopython depends on and it is numpy. I do not see networkx as
>> equaly important. I think that we should go the way paved by the bio.phylo
>> and use the simple digraph (which already holds all the information from
>> the OBO files afaik) for parsing output and convert it to networkx where
>> necessary.
>>
>> best
>> Bartek
>> --
>> Bartek Wilczynski
>> ==================
>> Institute of Informatics
>> University of Warsaw
>> http://www.mimuw.edu.pl/~bartek
>>
>
>
>
> --
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
> >>----.<--.>++++++.<<<<------------------------------------.
>



-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek



More information about the Biopython-dev mailing list