[Bioperl-l] Structured (nested) Annotation

Ewan Birney birney@ebi.ac.uk
Tue, 8 Oct 2002 07:53:11 +0100 (BST)


On Mon, 7 Oct 2002, Chris Mungall wrote:

>
> Hmm, I have a few problems with this thread - I'm always a little uneasy
> with the idea that everything has to fit into some perfect uber object
> model, and anything that diverges from the Perfect Model is Obviously
> Wrong.
>
> Ok, I'll admit the practice of using the same sequence entity for multiple
> proteins in multiple species seems a little off - if you want to annotate
> the sequence with post-translational modifications, can you be sure the
> annotations are true across all species?
>
> But then swissprot may have good reasons for collapsing identical
> sequences into the same entity - ease of database management for one
> thing.
>
> Besides, ensembl does a similar thing with alternate spliceforms producing
> proteins with identical sequences - these are collapsed into the same
> entity, even though they are distinct proteins, possibly with distinct
> cellular localisation, post translational modifications etc.

Well - not actually ensembl doesn't collapse the first case (two things
with different UTRs get different ENSP's currently - could actually get
the same, our choice).

We - like everyone else - do not even attempt to track post translational
modification sets with unique ids - does flybase? - swissprot track just
the modifications which is possibly the only sane way to do this.

>
> Most ensembl users don't care, so therefore ensembl is correct to do this.
> But the point is, there are always multiple ways of viewing the data. A
> view with everything disentangled will be too big, you're always going to
> have to collapse some entities (eg protein and protein sequence). There's
> no one correct way of doing it.
>

Yes - like many parts of software engineering, there are just "decisions
with consequences" not "right or wrong answers". You should have seen my
nice letter I sent to swissprot which starts off with the words "I know
there are good reasons to do this".


BTW - I think the driving force for this is not because swissprot think it
is a good way to represent the data, but because they want to minimise
redundant annotation time - very sane. However, something which you could
solve in software and still track species vs entries cleanly.


I'm going to stick to my guns and claim that a one-to-one mapping between
entry and species (and ... yes... of course their are honest to god
chimeric sequences which wont work here, but consider that to be a
special, special case) is the call with the least damaging consequences
**GIVEN** that you can write a software tool that allows "trivially
redundant" annotation to be propagated across close orthologous with no
effort.




> Another thing - has anyone considered instead of the Bio::Annotation
> object just attaching a lightweight xml structure to the seq/feature? this
> could be a simple nested array. you could use standard ways of
> querying/transforming this. I've used this pattern in the past, it's nice.
> Strict model where you need it, loose/extensible where you want it.
>
> Ok, I will admit this is a bit daft:
> GN   (CALM1 OR CAM1 OR CALM OR CAM) AND (CALM2 OR CAM2 OR CAMB) AND
> GN   (CALM3 OR CAM3 OR CAMC).
>
> But I'm sure you can get away with turning it into a flat list - gene
> symbols are generally a bit of a nightmare anyway and that's no sp's
> fault.


You're bang right there. I will forward the email I sent swissprot guys so
you know where I am coming from!


>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>