[Biocorba-l] Biocorba IDL - Clarifications

Matthew Pocock mrp@sanger.ac.uk
Fri, 20 Oct 2000 18:43:51 +0100


Ewan Birney wrote:

> On Fri, 20 Oct 2000, Alan Robinson wrote:
>
> > Hi all,
> >
> > Could someone clarify the correct behaviour for the following situations
> > for me, please?
> >
> > I have an EMBL sequence with the following feature and location:
> >
> >   FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
> >
> > What should be the SeqFeaure returned for this?
> >
> > Currently, I simply generate and return:
> >
> >       type = mRNA
> >       source = <none>
> >       seq_primary_id = X02158
> > *     start = 397
> > *     end = 3327
> >       strand = +1
> >       qualifiers = <none>
> >       PrimarySeq_is_available = true
> >
> > It's the start and end values I'm concerned with. The current SeqFeature
> > is rather minimal (not necessarily a bad thing!). But, I cannot see any
> > way to specify the multiple locations of this feature cleanly in
> > SeqFeature.
> >
>
> Right. We punted on this one previously. But we have to solve it.
>
> (this is one of the 'unsolved design problems in bioinformatics in my
> view')
>
> options:
>
> (a) have an extended "range" object that is a composite of simple range.
> Ie...
>
>    interface Range {
>       int start;
>       int end;
>       int strand;
>    };
>
>    interface CompositeRange : Range {
>       sequence<Range> sub_Range();
>    };
>

...or alternatively features have a Composite range which must contain 1 or more
Range objects. as (a) stands, this allows arbitrary nesting which will cause
problems at some time. CompositeRange can easily be passed arround as an array
of Range objects which will be very cheap. ?BioJava location Interface?

>
> Sequence Feature have-a Range that could either be CompositeRange or a
> Range
>
>    -- beneifts - close to original embl/genbank ft model
>                - close to biojava
>
>    -- drawbacks - really mini-objects here - could this be a struct?
>                 - not close to bioperl
>                 - annoys ewan because it allows people to overload
> semantics into features without making different objects, eg, like the
> way the CDS line "means" a gene in EMBL Feature table and you have to
> combine it with a mRNA line to get the whole structure with UTRs
>
> (b). Same as above, but reverse the containment rules. I like this because
> with my experience in Ensembl, you want to have many ranges pointing to
> the same sequence feature.
>
> ie, Ranges have-a sequence feature object, and sequences return a list of
> ranges.
>
>     -- beneifts - above:
>
>     -- drawbacks - above (A) drawbacks, plus it makes the feature objects
> pathetically small in some cases, with just "primary_tag" and "source_tag"
>

This makes my head hurt - can you give a concrete example?

>
> (c) let sequence features have sub sequence features
>
>     -- benefits - close to bioperl
>                   less objects
>     -- drawbacks - still open to abuse as in (a) and also question about
>                    how you map EMBL feature table lines to
>

I like this one for features in general. Mapping the EMBL feature table lines to
an apropreate nested structure is perhaps the problem domain of the person
writing the EMBL->biocorba server than the maintainers of the idl :-)

>
> (d) Let people extend sequence feature with sub sequence feature or more
> complex locations, ie:
>
> We would put in biocorba.idl a number of "common extensions", with complex
> locations being one of them...
>
>     ComplexLocationSeqFeature : SeqFeature {
>         // could discuss how we extend this
>         sequence<SeqFeature> subSequenceFeatures;
>         string embl_location_line(); // for anal parsing
>
>     -- benefits - using inheritance properely to solve this problem.
> Novel route in bioinformatics surely!
>
>                 - also can extend for fuzziness
>
>     -- drawbacks - none?
>
> I'm going for (d) in this - but the floor is open for other opinions here.
>

I think this would be better named something like CompoundSeqFeature - an
extention of SeqFeature that has zero or more child features, and contains all
of the sequence contained by its children. ComplexLocationSeqfeature pushes the
view that the entire feature is polymorphic, where as it is realy the location
property that has multiple implementations. There are lots of other ways that
the feature blob could be extended - extra fields, clever attached annotation
etc., and it would be a shame to get a multiple-inheritance all-against-all mess
when composition-delegation would let us mix-and-match each variable part more
easily.

>
> Whaddya reckon?
>
> > If I were to define each location separately, I cannot see how to describe
> > their implicit relationship under the current IDL without some potentially
> > hairy naming convention or qualifiers that stresses the life of the client
> > writer.
> >
> > Alternatively, one could return the location string via a new method, or
> > in the qualifier.