[Bioperl-l] Structured (nested) Annotation

Hilmar Lapp hlapp@gnf.org
Sun, 6 Oct 2002 23:21:22 -0700


I wanted to be able to represent structured, nested annotation (we 
were there a while ago weren't we?). I made two changes/additions 
that can accomplish this, but in different ways.

1) Annotation collections can now be nested, because 
Annotation::Collection now implements AnnotationI.

I added two methods dealing specifically with nested annotation: 
get_all_Annotations() is similar to get_Annotations(), but traverses 
the whole tree of nested collections (if there is no nesting, it 
behaves identical to get_Annotations(). flatten_Annotations() makes 
a nested collection un-nested.

I thought it may be a good idea to promote get_all_Annotations() to 
the interface (AnnotationCollectionI). What do people think? 
Ewan/Jason?

2) Nesting through all-objects is somewhat heavy-weight if all you 
want to nest is simple values. So I added 
Annotation::StructuredValue which inherits from 
Annotation::SimpleValue and can be called as if it were a simple 
value. In addition, there are methods to add simple values in a 
structured, nested way, and control the way the structure is 
flattened into a single string. Also, there is get_all_values() 
returning a flattened array of values.

My starting use case was to somehow retain the structured 
information in swissprot GN lines.

In case you're unfamiliar with swissprot, GN gives the names of the 
genes that give rise to the protein sequence of the entry. Different 
genes are concatenated by ' AND ', whereas synonyms of the same gene 
are concatenated by ' OR '. Both may co-occur in the same GN line, 
in which case parentheses are used to group. An infamous example is 
Calmodulin (many many species for this entry ...):

GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
GN   (CALM3 OR CAM3 OR CAMC).

The parser screwed this up before because it didn't know about the 
AND join operator nor possible nesting (nor multiple GN lines). I 
fixed all this and changed the parser to now construct a 
Annotation::StructuredValue object for this.

NOTE: The consequence is that now you get back only _ONE_ object for 
$seq->annotation->get_Annotations("gene_name"). You need to call 
get_all_values() on that object to see all the names. Calling 
value() will return the structured array flattened into a single 
string.

	-hilmar

--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------