[Biopython-dev] KEGG API Wrapper

Wed Dec 19 04:25:35 UTC 2012

Hi All,

Sorry in the delay in updating this KEGG code. Michiel, I've addressed your
suggestions regarding the querying code and the documentation and have
committed changes that reflect this. (
https://github.com/kevinwuhoo/biopython/) There's a namespace collision
created by the KEGG.list function, so I use KEGG.list_ instead. However,
I'm sure there's a more elegant solution than this.

Regarding the parsers, there should be a way to unify all parsers and
writers for KEGG objects as they list fields for all their objects here:
http://www.kegg.jp/kegg/rest/dbentry.html. Each class should extend from a
parent while specifying their valid fields. Parsing all files should be
generalized, but there should be field specific code to handle the
different fields so that fields like genes are handled correctly and
ubiquitously.

After solidifying discussion on these, I'll move the tests over to unittest
too.

Thanks!
Kevin

On Thu, Oct 25, 2012 at 7:52 PM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> Hi Kevin,
>
> Thanks for the documentation! That makes everything a lot clearer.
> Overall I like the querying code and I think we should add it to Biopython.
>
> I have a bunch of comments on the KEGG module, some on the existing code
> and some on the new querying code, see below. Most of these are trivial;
> some may need some further discussion. Perhaps could you let us know which
> of these comments you can address, and which ones you want to skip for now?
>
> Once we converged with regards to the querying code and the documentation,
> I think we can import your version of the KEGG module into the main
> Biopython repository and add your chapter on KEGG to the main
> documentation, and continue from there on the parsers and the unit tests.
>
> Many thanks!
> -Michiel.
>
>
> About the querying code:
> ----------------------------------
>
> I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv",
> KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions
> KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get.
>
> For list, find, conv, link, and info, instead of going through
> KEGG.generic_parser, I would return the result directly as a Python list.
> In contrast, KEGG.get should return the handle to the results, not the
> data itself. So the _q function, instead of
>   ...
>   resp = urllib2.urlopen(req)
>   data = resp.read()
>   return query_url, data
> have
>   ...
>   resp = urllib2.urlopen(req)
>   return resp
> Then the user can decide whether to parse the data on the fly with
> Bio.KEGG, or read the data line by line and pick up what they are
> interested in, or to get all data from the handle and save it in a file.
> Note that resp will have a .url attribute that contains the url, so you
> won't need the ret_url keyword.
>
> About the parsers:
> ------------------------
>
> I think that we should drop generic_parser. For link, find, conv, link,
> and info, parsing is trivial and can be done by the respective functions
> directly. For get, we already have an appropriate parser for some databases
> (compound, map, and enzyme), but it's easy to add parsers for the other
> databases.
>
> For all parsers in Biopython, there is the question whether the record
> should store information in attributes (as is currently done in Bio.KEGG),
> or alternatively if the record should inherit from a dictionary and store
> information in keys in the dictionary. Personally I have a preference for a
> dictionary, since that allows us to use the exact same keys in the
> dictionary as is used in the file (e.g., we can use "CLASS" as a key, while
> we cannot use .class as an attribute since it is a reserved word, so we use
> .classname instead). But other Biopython developers may not agree with me,
> and to some extent it depends on personal preference.
>
> The parsers miss some key words. The ones I noticed are ALL_REAC,
> REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit
> tests.
>
> Remove the ';' at the end of each term in record.classname.
>
> Convert record.genes to a dictionary for each organism. So instead of
> [('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON',
> ['100190836', '100438793']), ('MCC', ['100424648', '699401']...
> have
> {'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON':
> ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ...
>
> Also for record.dblinks, record.disease, record.structures, use a
> dictionary.
>
> In record.pathway, all entries start with 'PATH'. Perhaps we should check
> with KEGG if there could be anything else than 'PATH' there, otherwise I
> don't see the reason why it's there. Assuming that there could be something
> different there, I would also use a dictionary with 'PATH' as the key.
>
> In record.reaction, some chemical names can be very long and extend over
> multiple lines. In such cases, the continuation line starts with a '$'. The
> parser should remove the '$' and join the two lines.
>
> About the tests:
> --------------------
>
> We should update the data files in Tests/KEGG. This will fix some "bugs"
> in these data files.
>
> We should switch test_KEGG.py to the unit test framework.
>
> We should do some more extensive testing to make sure we are not missing
> some key words.
>
> About the documentation:
> ---------------------------------
> It's great that we now have some documentation.
>
> On page 233, I would suggest to replace the "id_" by "accession" or
> something else, since the underscore in "id_" may look funky to new users.
>
> Also it may be better not to reuse variable names (e.g. "pathway" is used
> in three different ways in the example). It's OK of course in general, but
> for this example it may be more clear to distinguish the different usages
> of this variable from each other.
>
> For repair_genes, you can use a set instead of a list throughout.
>
>
>
>
> --- On *Wed, 10/24/12, Kevin Wu <kjwu at ucsd.edu>* wrote:
>
>
> From: Kevin Wu <kjwu at ucsd.edu>
> Subject: Re: [Biopython-dev] KEGG API Wrapper
> To: "Peter Cock" <p.j.a.cock at googlemail.com>, "Zachary Charlop-Powers" <
> zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: Biopython-dev at lists.open-bio.org
> Date: Wednesday, October 24, 2012, 6:38 PM
>
>
> Hi All,
>
> Thanks for the comments, I've written a bit of documentation on the entire
> KEGG module and have attached those relevant pages to the email. There
> didn't seem like an appropriate place for examples, so I just added a new
> chapter. I've also committed the updated file to github.
>
> I did leave out the parsers due to the fact that the current parsers only
> cover a small portion of possible responses from the api. Also, I'm not
> confident that the some of the parsers correctly retrieves all the fields.
> However, I've written a really general parser that does a rough job of
> retrieving fields if it's a database format returned since I find myself
> reusing the code for all database formats. It's possible to modify this to
> correctly account for the different fields, but would probably take a bit
> of work to manually figure each field out. Otherwise it also parses the
> tsv/flat file returned.
>
> Also, @zach, thanks for checking it out and testing it!
>
> Thanks All!
> Kevin
>
> On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock <p.j.a.cock at googlemail.com<http://mc/compose?to=p.j.a.cock@googlemail.com>
> > wrote:
>
> On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers
> <zcharlop at mail.rockefeller.edu<http://mc/compose?to=zcharlop@mail.rockefeller.edu>>
> wrote:
> > Kevin,
> > Michiel,
> >
> > I just tested Kevin's code for a few simple queries and it worked great.
> I
> > have always liked KEGG's organization of data and really appreciate this
> > RESTful interface to their data; in some ways I think it easier to use
> the
> > web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of
> > metabolic networks is awesome.  I found the examples in Kevin's test
> script
> > to be fairly self-explanatory but a simple-spelled out example in the
> > Tutorial would be nice.
> >
> > One thought, though, is that you can retrieve MANY different types of
> data
> > from the KEGG Rest API - which means that the user will probably have to
> > parse the data his/herself. Data retrieved with "list" can return lists
> of
> > genes or compounds or organism and after a  cursory look  these are each
> > formatted differently. Also true with the 'find' command. So I think you
> > were right to leave out parsers because i think they will be a moving
> target
> > highly dependent on the query.
> >
> > Thank You Kevin,
> > zach cp
>
> Good point about decoupling the web API wrapper and the parsers -
> how the Bio.Entrez module and Bio.TogoWS handle this is to return
> handles for web results, which you can then parse with an appropriate
> parser (e.g. SeqIO for GenBank files, Medline parser, etc).
>
> Note that this is a little more fiddly under Python 3 due to the text
> mode distinction between unicode and binary... just something to
> keep in the back of your mind.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org<http://mc/compose?to=Biopython-dev@lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>