[Bioperl-l] Structured (nested) Annotation

Hilmar Lapp hlapp@gnf.org
Mon, 7 Oct 2002 15:25:21 -0700


> -----Original Message-----
> From: Chris Mungall [mailto:cjm@fruitfly.org]
> Sent: Monday, October 07, 2002 1:07 PM
> To: Hilmar Lapp; bioperl-l@bioperl.org
> Subject: RE: [Bioperl-l] Structured (nested) Annotation
> 
> 
> 
> Hmm, I have a few problems with this thread - I'm always a 
> little uneasy
> with the idea that everything has to fit into some perfect uber object
> model, and anything that diverges from the Perfect Model is Obviously
> Wrong.
> 
> Ok, I'll admit the practice of using the same sequence entity 
> for multiple
> proteins in multiple species seems a little off - if you want 
> to annotate
> the sequence with post-translational modifications, can you 
> be sure the
> annotations are true across all species?

There are more problems than that, like gene names, references, cross-references (dblinks) etc all pertain in many cases only to one species, or to a certain subset. Trying to capture this in the annotation (which SP does for references: each one's got a species - yes, references have a species as attribute) makes the view not simpler but pretty complex, and not trying to capture it (which SP does for dblinks and gene names) is dismissing annotation structure that may be important to quite a number of people.

> 
> But then swissprot may have good reasons for collapsing identical
> sequences into the same entity - ease of database management for one
> thing.
> 
> Besides, ensembl does a similar thing with alternate 
> spliceforms producing
> proteins with identical sequences - these are collapsed into the same
> entity, even though they are distinct proteins, possibly with distinct
> cellular localisation, post translational modifications etc.

Same entity of what? Same transcript? Two transcripts which happen to result in the same protein sequence get the same identifier (are the same thing)? 

If that's true, I would care ...

> 
> Most ensembl users don't care, so therefore ensembl is 
> correct to do this.
> But the point is, there are always multiple ways of viewing 
> the data. A
> view with everything disentangled will be too big, you're 
> always going to
> have to collapse some entities (eg protein and protein 
> sequence). There's
> no one correct way of doing it.
> 
> Another thing - has anyone considered instead of the Bio::Annotation
> object just attaching a lightweight xml structure to the 
> seq/feature? 

Not yet, but the Annotation::StructuredValue is lightweight too. Do you have code that showcases this? As I said, I'm happy to adopt a better solution if it does at least what mine does.

> this
> could be a simple nested array. you could use standard ways of
> querying/transforming this. I've used this pattern in the 
> past, it's nice.
> Strict model where you need it, loose/extensible where you want it.
> 
> Ok, I will admit this is a bit daft:
> GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
> GN   (CALM3 OR CAM3 OR CAMC).
> 
> But I'm sure you can get away with turning it into a flat list

That's exactly what I want to avoid. As I said I'm currently flattening this anyway before it's dumped into biosql, but it's not the parser who flattens it. We will want to use this as a source for computing synonym relationship information.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------