[Bioperl-l] Annotation structure

Ewan Birney birney@ebi.ac.uk
Thu, 2 Aug 2001 22:57:14 +0100 (BST)


[cc'ing matt and thomas in because I want to understand their design
decision in biojava]



As mentioned at BOSC, I want to overhaul the annotation
structure. Currently we have the rather crappy 

  $seq->anntation(); # gives an annotation object

  $annotation->each_Reference(); # list of refernces
  $annotation->each_Comment();   # list of comments
  $annotation->each_DBLink();    # list of dblinks

This is very much "what you need to store for round tripping
genbank-embl". it is not focused at all on extensibility.


The proposal is to head towards more of a generic tag => list of values
scheme which will (a) extend better (b) plays well with biojava and
biocorba much better. My current proposal is this

Bio::Annotation moves to Bio::AnnotationCollection. 

Bio::AnnotableI (direct copy from biojava) defines the method

   $obj->annotation(); 

which gives back a Bio::AnnotationCollection

Bio::AnnotationCollection is:


=head1 NAME

Bio::AnnotationCollectionI - Interface for annotation collections

=head1 SYNOPSIS

   # get an AnnotationCollectionI somehow, eg

   $ac = $seq->annotation();

   foreach $key ( $ac->get_all_annotation_keys() ) {
       @values = $ac->get_Annotations($key);
       foreach $value ( @values ) {
          # value is an Bio::AnnotationI, and defines a "string" method
          print "Annotation ",$key," stringified value 
             ",$value,"\n";
       }
   } 
          

=head1 DESCRIPTION

Interface for a collection of annotations

=head1 FEEDBACK



All well and good, but here is the list of design decisions here we get
into (for better or worse)...



(a) I always feel we have one too many class here - I sort of want to
remove AnnotableI and make Seq inheriet from AnnotationCollectionI. But
this is the way biojava does it (which may well be due to how we did it in
the first place) relates to (c) below


(b) We've got have some additional standard of "standard" keys, like

   reference, dblink, comment 

etc to agree on. That's ok - that's what you live with for extensibility,
but there is an argument that you might want something more heirarchical
such that

    @objects = $ann->get_Annotation("geneticdisase")

would give you back Bio::Something::Disease::Genetic but 
 
    @objects = $ann->get_Annotation("disease")

gives back the superset. Some heirarchical type system (centrally
controlled?) controls the standard. (good? Bad?)


After thinking about this I don't like it - it is asking for quite a heavy
system behind the scenes (not so heavy, but heavy enough) to manage this
and will make implementing other objects behind this interface
tough. Hmph.


In general, if we do set a standard set of tags, to what extent should we
enforce the tag-->object mapping. I'm leaning towards relatively strictly
enforcing it with a hash in AnnotationCollectionI being something like


%tag_object_map = (
	'reference' => 'Bio::Annotation::RefernceI',
	'dblink'    => 'Bio::Annotation::DBLinkI',
	'comment'   => 'Bio::Annotation::CommentI' );


with the idea that implementations enforce these rules of their annotation
collections.



(c) Biojava and Biocorba reuse the annotation interface for the tag-value
qualifiers off features and (therefore) have the same extensibility of
their annotations. I've always been against this because it seems to have
to store what is very often strings, but I think I have been a bit of a
luddite here: the killer use case is a gene seqfeature which should have
as rich an annotation - and as extensible - as a sequence.


The problem here is that I want to keep backward compatibility with the
current has_tag_value, each_tag_value system on SeqFeatureI reusing the
AnnotationI ->string method to allow to put these in. This means I want

  SeqFeatureI to inheriet from AnnotationCollectionI

this is different from biojava which I believe has SeqFeatureI equivalent 
inherieting from Annotable and so having a separate annotation call to the
annotationcollection object. To make sure seqfeatures can maintain the old
has_tag_value etc there would be some somewhat ugly delegation (i guess
not so bad) out to this annotation object.


This will make SeqFeature::Generic much heavier if we have to build a
Bio::AnnotationCollection object for each SeqFeature::Generic and this is
bad news as we make millions of SeqFeature::Generic's...



So - question for Matt/Thomas - why do you split out AnnotableI and
AnnotationCollectionI in biojava - what is the win?



(d) AnnotationI, Serialisation.


What should AnnotationI support?

By supporting ->string we at least allow every client to display
*something* to something for an annotation. I think this is important
otherwise things like SeqCanvas wont be able to show anything. (bad).


I am sorely tempted to try to build other, richer serialisation standards
in here. This would be sort of like the to_FTHelper system for sequence
features but perhaps something XML-like. Something like

# might not be good for large objects 
$xml_string = $annotation->to_XML

or

# painful for getting it back to a string, could use IO::String
$stream = \*STDOUT;
$annotation->write_XML($stream)



What do people think here? Useful? I suspect putting something like this
is good. I dislike exposing the entire tag-value system inherent in XML in
a DOM like way as it encourages crazy travesals of the tree of values and
just kills performance.


If we do an XML serialisation should we abstract out
GenBank/EMBL/Swissprot header section writing in a sort of
($line-tag-meta-type,@list-of-lines) data structure to allow extensibility
of writing these formats. Or - as the formats are standardised, perhaps we
should leave this to standardised objects. Hmmmmm.



Do we need a basis object (experimental/computational/reference) and if
so, what should it look like?




Do people have opinions on
this? Jason/Hilmar/Heikki/Matt/Thomas/Mark+David are the people I am most
interested in hearing from. Key questions:


  (a) rigid biojava/biocorba cribbing, or removing this AnnotableI
interface? (I favour removing)
 
  (b) type enforcement of standard types (I like enforcement - it will
catch otherwise weird lookig bugs)

  (c) type heirarchy or flat (I favour flat)

  (d) XML serialisation and how to do it (I think it is a good thing, no
clear ideas how to do it. Has someone done this before and had stories. I
should bond with Lincoln about Boulder next week)

  (e) basis object - Chris Mungall probably has one for GO which I should
crib



anyone else of course - feel free to chime in



i have the nasty feeling one of the decisions we make we will regret. I
wonder which one!




ewan





-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------