[Bioperl-l] New Annotation interfaces! Mark/David/whoever - check it out!

Chris Mungall cjm@fruitfly.bdgp.berkeley.edu
Wed, 31 Oct 2001 00:23:38 -0800 (PST)


I really like Ewan's recursive hash tree. It does raise some interesting
questions.
 
For instance: as a general question, in object model design, how deep do
you go with your UML diagrams and class definitions, and when do you reach
a level of granularity where you don't want the hassle of hundreds of
classes and want to use name/value pairs and/or a more advanced recursive
hash_tree system a la AnnotationI.
 
While I think the above hash tree mechanism will suffice for roundtrip
parsing/exports (which is a large proportion of use cases) there may come
a point where you'll want to start to extend it.
 
One extension would be to make it graph based rather than tree based. Much
harder to deal with, but it gives you more flexibility (eg authors to
references are many to many).
 
Another thing is you may want to introduce some level of type safety; it
may seem tricky to balance type safety and extensibility but there are
some neat solutions. For instance you could include meta data in the
hash_tree - allowable values for certain keys and that kind of thing. By
tying the meta data to the data itself you keep it extensible.
 
But hold on a minute - for the last god knows how long the KR people have
been in the corner of playground all by themselves doing this sort of
thing . And it turns out they have some jolly good ideas! Check out the
DAML+OIL spec - it's layered on top of RDF (RDF is to graphs what XML is
to trees). AI people are interested in it, particularly as an
infrastructure for distributed inference/reasoning over the semantic web.
But leaving that aside from the moment it turns out to be a great way of
creating an extensible class system, with most the type safety of
perl/java/C++ but with the freedom and extensibility of the AnnotationI
system below. Not only that, but the tools that go with it (such as
Protege-2000 - open source, naturally) are *way* nicer to use than bloody
rational rose and all that snooze inducing $$$ UML stuff. In fact even
biologists can use them (you wouldn't let someone who hadn't read Design
Patterns at least 5 times anywhere near your UML/class system, would
you?).

Personally I'd like to see this kind of extensible system introduced in
other areas of bioperl - e.g. instead of hardcoding Gene/Transcript/Exon
relationships inside perl classes, but I think I've lost the vote on that
one (I promise not to go on about polycistronic genes again).

I've actually been toying around with an extensible extension (!) to the
perl object system, you can see so far very amateurish results at
bioinformatics.org under knowledge-based objects. I wouldn't normally
announce this as it's just some hacked together ideas at the moment, but
it is very relevant to the system below.


On Tue, 30 Oct 2001, Ewan Birney wrote:

> 
> 
> 
> Well - two hours of British Rail has its benefits. I have committed the
> new Annotation framework. This is a pretty major set of changes! The 
> exciting thing is that this is **definitely** the right way to go.
> 
> 
> The most important point here is the framework is
> 
>   (a) extensible 
> 
> and
> 
>   (b) plays well with XML/data orientated approaches.
> 
> 
> I've included in this message to two main interfaces -
> Bio::AnnotationCollectionI and Bio::AnnotationI. I have implemented
> this in Bio::Annotation::Collection and adapted the existing
> Bio::Annotation::* classes to work with this.
> 
> Then I have added in backward-compatibility harness for
> Bio::Annotation::Collection for the (i guess) 0.7.* API (it calls
> deprecated for each function, so you will know it).
> 
> Then I adapted the genbank/embl/swiss SeqIO systems to work - was easy
> (tick for the design I think) and t/SeqIO.t passed without any additional
> work (wow!)
> 
> 
> Things on my TODO list 
> 
>   - put in Controlled Vocab somehow - need to talk to the GO folks 
> to check I do it right
> 
>   - deal/decide with updates as I am sure GenQuire will want to write
> back here (right guys??)
> 
>   - check XML outputting
> 
> on explatory is
> 
>   - generic XML registration for generic XML stream -> Annotation objects
> 
> 
> 
> This has been a Bioperl production brought to you by British Rail...
> 
> 
> ewan  
> 
> 
> 
> 
> 
> =head1 NAME
> 
> Bio::AnnotationCollectionI - Interface for annotation collections
> 
> =head1 SYNOPSIS
> 
>    # get an AnnotationCollectionI somehow, eg
> 
>    $ac = $seq->annotation();
> 
>    foreach $key ( $ac->get_all_annotation_keys() ) {
>        @values = $ac->get_Annotations($key);
>        foreach $value ( @values ) {
>           # value is an Bio::AnnotationI, and defines a "as_text" method
>           print "Annotation ",$key," stringified value ",$value->as_text,"\n";
>           
>           # also defined hash_tree method, which allows data orientated
>           # access into this object
>           $hash = $value->hash_tree();
>        }
>    } 
>           
> 
> =head1 DESCRIPTION
> 
> Annotation Collections are a way of storing a series of "interesting
> facts" about something. We call an "interesting fact" in Bioperl an
> Annotation (this differs from a Sequence Feature, which is called
> a Sequence Feature and may or may not have an Annotation Collection).
> 
> The trouble about this is we are not that sure what "interesting
> facts" someone might want to store: the possibility is endless. 
> 
> Bioperl's approach is that the "interesting facts" are represented by
> Bio::AnnotationI objects. The interface Bio::AnnotationI guarentees
> two methods
> 
>    $obj->as_text(); # string formated to display to users
> 
> and
> 
>    $obj->hash_tree(); # hash with defined rules for data-orientated discovery
> 
> The hash_tree method is designed to play well with XML output and
> other "nested-tag-of-data-values" think BoulderIO and/or Ace stuff. For more
> info read Bio::AnnotationI docs
> 
> Annotations are stored in AnnotationCollections, each Annotation under a
> different "tag". The tags allow simple discovery of the available annotations,
> and in some cases (like the tag "gene_name") indicate how to interpret the
> data underneath the tag. The tag is only one tag deep and each tag can have an
> array of values.
> 
> In addition, AnnotationCollectionI's are guarentee to maintain a consistent
> set object values under each tag - at least that each object complies to one
> interface. The "standard" AnnotationCollection insists the following rules
> are set up
> 
>   Tag         Object
>   ---         ------
>   reference   Bio::Annotation::Reference
>   comment     Bio::Annotation::Comment
>   dblink      Bio::Annotation::DBLink
>   gene_name   Bio::Annotation::SimpleValue
>   description Bio::Annotation::SimpleValue
> 
> These tags are the implict tags that the SeqIO system needs to round-trip
> GenBank/EMBL/Swissprot.
> 
> However, you as a user and us collectively as a community can grow the
> "standard" tag mapping over time and specifically for a particular
> area.
> 
> 
> 
> =head1 NAME
> 
> Bio::AnnotationI - Annotation interface
> 
> =head1 SYNOPSIS
> 
>   # generally you get AnnotationI's from AnnotationCollectionI's
> 
>    foreach $key ( $ac->get_all_annotation_keys() ) {
>        @values = $ac->get_Annotations($key);
>        foreach $value ( @values ) {
>           # value is an Bio::AnnotationI, and defines a "as_text" method
>           print "Annotation ",$key," stringified value ",$value->as_text,"\n";
>           # you can also use a generic hash_tree method for getting 
>           # stuff out say into XML format
>           $hash_tree = $value->hash_tree();
>        }
>    } 
> 
> 
> =head1 DESCRIPTION
> 
> Interface all annotations must support. There are two things that each annotation
> has to support.
> 
>   $annotation->as_text()
> 
> Annotations have to support an "as_text" method. This should be a
> single text string, without newlines representing the annotation,
> mainly for human readability. It is not aimed at being able to
> store/represent the annotation
> 
> The second method allows annotations to at least attempt to represent
> themselves as pure data for storage/display/whatever. The method
> hash_tree
> 
>    $hash = $annotation->hash_tree();
> 
> should return an anonymous hash with "XML-like" formatting. The
> formatting is as follows.
> 
>   (1) For each key in the hash, if the value is a reference'd array -
> 
>       (2) For each element of the array if the value is a object - 
>           Assumme the object has the method "hash_tree";
>       (3) else if the value is a referene to a hash
>           Recurse again from point (1)
>       (4) else 
>           Assumme the value is a scalar, and handle it directly as text
> 
>    (5) else (if not an array) apply rules 2,3 and 4 to value
> 
> The XML path in tags is represented by the keys taken in the
> hashes. When arrays are encountered they are all present in the path
> level of this tag
> 
> This is a pretty "natural" representation of an object tree in an XML
> style, without forcing everything to inheriet off some super-generic
> interface for representing things in the hash.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>