[Bioperl-l] chado code

Chris Mungall cjm@fruitfly.org
Tue, 26 Nov 2002 18:24:19 -0800 (PST)


I have some code almost ready to check in. I'm not quite sure where it all
fits in yet so I thought I'd run it by you all.

This all revolves around the nascent chado schema, info available from
www.gmod.org

quick summary - multiple modules covering different biological domains.
sequence module has the concept of feature as nodes in a feature
relationship graph

this is the code and the repositories i'm thinking of -

bioperl:
=======

Bio::SeqIO::chadoxml
Bio::SearchIO::chadoxml

So far these only go from bioperl objects --> chado, that's what I'm
mostly interested in at the moment. It works by turning the bioperl
objects into a tree / hierarchical structured tag representation of the
chado schema. This tree can be represented as XML (or S-expressions, which
I prefer). This leverages a whole bunch of incredibly useful bioperl code
for chado.

It drops a lot of data at the moment, I plan to add this as I need it

note that the chado schema and the corresponding chado xml are not yet
stable, but i'm charging on with this anyway, perhaps to the consternation
of the other chado developers - sorry guys.

bioperl-db:
==========

Bio::DB::ChadoSQL::*

Not quite there yet. I'm just working on chado-xml -> chado db at the
moment. This is actually super simple as the chado-xml tree representation
maps almost directly to the schema (there's a few denormalisations in my
version of the chado-xml to make it nicer). It's just a matter of
recursively descending the tree and updating/inserting based on the unique
constraints.

I'll also have adaptors (eg SeqAdaptor, SeqFeatureAdaptor), but these will
basically be simple wrappers that use the IO classes to make a tree
object, then generically store the tree. Personally I quite like this
seperation between schema and objects.

go:
==

go2chadoxml

this is already checked in to the go-dev repository - takes GO flat files
and turns them into a tree representation that maps to the cv module in
the chado schema - also computes the closure. the generic chado loader
(above) can then take these trees and store them.

gmod: (part of the chado repository or independent?)
====

Bio::Chado::Transform::*

not sure about the namespace yet. this is a collection of transforms
operating over chado tree representations. XSLT-like, but without the XSLT
- the transforms are specified in perl, operating over trees rather than
objects.

the transforms include:

* location inference: eg take a feature graph and fill in begin/end coords
for non-leaf features (eg transcripts) based on coords of leaf features
(eg exons)

* coordinate transformation: move a feature from one assembly level to
another, or represent redundant locations on multiple assembly levels

* feature inference: generate redundant implicit features (eg introns,
splice sites, UTR) from explicit features

* sequence inference: calculate residues based on feature coordinates and
central dogma, taking into account various biological weirdnesses the
sequence ontology and chado are designed to cope with - eg transplicing,
transcript editing, stop codon readthroughs etc etc

another useful transform would be taking a direct mapping of genbank to
chado, then turning this into something more usable (eg correctly
organising the mRNA and CDS features in a feature graph, mapping to the
Sequence Ontology)

lots of other transforms possible, for variation features, ontologies,
comp analyses, genetic interactions....

I also have some code for viewing chado feature graphs as interconnected
coloured bubbles.

the idea is you can pipe a bunch of these transforms together to get a
tree that is most useful to your application (also useful for building
warehouse versions of the database)

modules required
////////////////

all this code depends on a module Data::Stag that I'm about to upload to
CPAN. this is a tree / structured tag module that happens to play well
with XML. I also find it to be a nice alternative to using objects and
object models, which I have recently taken an aversion to.

discussion
//////////

rather than spreading all this code over multiple cvs repositories in
bioperl and gmod I'm entertaining the idea of collecting them together as
a single codebase (I guess the SeqIO and SearchIO should stay in bioperl).

There's various motives behind this plan. I'm attempting to get out of
software engineering, and anyway, I don't have a great track record
supporting my software. I see this code as my own handy data toolkit that
I'd like to make available to anyone who would find it useful (as opposed
to a Grand Engineering project). Another reason is that this code embodies
a programming paradigm that is perhaps a little idiosyncratic to some,
particularly the eschewing of object modeling, and the use of a tree data
structure alternative to XML. Perhaps I am insane and it all won't work,
in which case I don't want to take anyone down with me, unless they are
insane too. Besides, I may decide to rewrite the whole thing in lisp
halfway through. So maybe it's all a bit experimental for bioperl/gmod?

Another thing is my chado-xml/chado-trees may diverge a bit from the
official chado xml-dtd/schema. this remains to be seen. the divergence
would be purely synctatic, not semantic.

A question for Lincoln: if I house something in GMOD do I then have to
commit to a certain level of support, documentation for the non-hacker,
bug fixes, etc?

I *am* fully committed to this for the chado schema, I don't mind doing
this for the chado SeqIO and SearchIO, and GO too - but the rest is just
my crazy stuff.

There is also talk of a chado API and a chado object model and possibly
chado UML. I won't be going anywhere near this, but if someone is going to
take this on and fully support it, provide bioperl interoperability, base
applications around it, etc, then I'd rather steer my code well out of the
way, to let this behemoth through to do it's business (I guess this would
most likely be in java?).

Even if my code becomes a distinct toolkit I'd like to keep the Bio::
namespace

Gosh, this email is actually longer than the code itself....

Anyway, I should have some mostly broken code ready to check in next week