[Biocorba-l] SeqFeatures and the EMBL IDL
Alan Robinson
alan@ebi.ac.uk
Sat, 21 Oct 2000 02:59:43 +0100 (GMT Daylight Time)
OK - I'm going to have a go at describing how the EMBL IDL handles
Features and I'll show some example Java code of how it is actually used.
This is IDL with a working CORBA server and can handle all the complexity
of EMBL Feature Locations. I neither designed nor wrote it, but I use it
routinely now (see http://corba.ebi.ac.uk/).
-- benefits - We know it handles all the complexity of EMBL feature
locations and has a working CORBA server. Streamlines
easily for simple Feature Locations.
-- drawbacks - It involves using unions and type codes. Client writers
may have to use recursion for processing very complicated
locations. Server writers generating the objects from
flatfiles will have their work cut out.
The IDL and its implementation can handle situations as trivial as:
FT exon 1..100 and complement(1..100, 150..200);
Or as complex as:
join(complement(1..50), one-of(201..220,219), complement(301..<350));
The EMBL IDL uses type codes and unions and is based on the assumption
that no single object could describe all the acrobatics that happen in a
Feature Location.
The basic concept is that the Location is composed of a series of
LocationNodes of different types that form a tree (e.g. a segment such as
"1..100", or an operator such as "complement"). The position of nodes in
the tree describes the relationship between these segments and operators.
For each Feature we request it's Location object. From the Location object
we get a list of LocationNodes. This LocationNodeList is composed of CORBA
union objects that represent the nodes and that we need to query to find
their type. Once we've identified an object's type, we can retrieve it
from the CORBA union object and then process it.
Thus, we define a Feature from which we can return a Location:
interface Feature {
...other stuff deleted...
Location getLocation();
}
This Location defines the following methods:
interface Location {
...other stuff deleted...
//Our method for "anal parsing".
string getLocationString();
//Get the nodes that describes the components of the Location
LocationNodeList getNodes();
}
LocationNodeList is a list of union objects defined in the Location
interface. Type codes are used to define the type of each object that the
a union object stores.
For simplicity, I'm going to pretend our locations only ever contain
segments (e.g. 10..50, 70..<100) and operators (e.g. join, complement,
etc.) [In fact, there are two additional types]:
interface Location {
...other stuff deleted...
//Define sequence of union results.
typedef sequence<LocationNode_u> LocationNodeList;
typedef long LocationNodeTypeCode;
//Define type codes for virtual segment and operator nodes.
const long VirtualSegment_ltc = 1;
const long Operator_ltc = 4;
//Definition of the union.
union LocationNode_u switch (LocationNodeTypeCode) {
case VirtualSegment_ltc: LocVirtualSegment virtual;
case Operator_ltc: LocOperator operator;
}
}
OK - Let's assume I've called getNodes() on my Location and it's returned
a LocationNodeList.
I need to determine the type of the first object in the list (the root of
the tree). This is done by calling the discriminator() method on this
union object:
int discriminator = locationNodeList[0].discriminator();
I check the value of 'discriminator' against my type codes. E.g. if it
returns '1', then my life is easy and the first node is a VirtualSegment
which I can return from the union object:
if (discriminator == 1) {
LocVirtualSegment segment = locationNodeList[0].virtual();
//...continue processing segment...
}
In this case the Location is of the form 'x..y'. The struct for
LocVirtualSegment is given below and I can find the start and end of the
Location (by using the Fuzzy struct, we can handle fuzziness too).
//Define the LocVirtualSegment struct:
struct LocVirtualSegment {
//The start location
type::Fuzzy start;
//The end location.
type::Fuzzy end;
//Is it the complementary strand.
boolean complement;
}
}
In module 'type', this is the Fuzzy struct and type codes:
struct Fuzzy {
//The value
long value;
//The length of the fuzzy value
long size;
//A type code describing the Fuzzy operator.
FuzzyTypeCode type;
}
typedef long FuzzyTypeCode;
const long Exact_ftc = 1;
const long In_ftc = 2;
const long Between_ftc = 3;
const long Less_ftc = 4;
const long Greater_ftc = 5;
If my life is more complex, then the first LocationNode may be an
operator (discriminator value = 4).
I can return this object from the LocationNodeList union object:
if (discriminator == 4) {
LocOperator locOperator = LocationNodeList[0].operator();
//...continue processing locOperator...
}
The struct for this 'operator' object is given below:
//Define the LocOperator struct:
struct LocOperator {
//The operation - "join", "complement", etc.
string operation;
//The ID's of the nodes to which this operation applies.
idList childIds;
}
I can find the nature of the operator from the 'operation' string of its
struct.
I can find the LocationNodes upon which this operator acts from the
idList. The idList is an array of ID's that point to the postions of the
children in the LocatioNodeList *after* the operator node.
e.g. childId's = 1,2,3 implies the next three nodes after this one in the
LocationNodeList are acted upon by this operator.
Thus the location "join(1..30, 80..<100)" would have a LocationNodeList of
length 3.
Node 0 = Operator node with operation = "join"; idList = {1,2};
Node 1 = VirtualSegment with start = 1 ; end = 30; / fuzzy = "Exact_ftc"
Node 2 = VirtualSegment with start = 80 ; end = 100 / fuzzy="Less_ftc";
The location "join(complement(1..30, 50..60), one-of(80..100,85..100))
could be defined:
Node 0 = Operator "join"; idList = {1,4};
Node 1 = Operator "complement"; idList = {1,2};
Node 2 = VirtualSegment for 1..30
Node 3 = VirtualSegment for 50..60
Node 4 = Operator "one-of"; idList = {1,2}
Node 5 = VirtualSegment 80..100
Node 6 = VirtualSegment 85..100
>From the client point of view, you need to be happy with the concept of
having a recursive call to traverse the tree completely. Unless you have
the BioCorba SeqFeature have start() and end() returning the extremes of
the SeqFeature, getLocationString() returning the string representation,
and a 'LocationNodeList getLocation()' return the full description.
In most normal cases (c.f. example 1 above), the LocationNodeList is a
List, and not a tree. The client could choose to ignore anything that
doesn't look straightforward (i.e. has more than one operator in the
list) and thus avoid having to write a recursive tree search.
*phew!*
Alan.
--
============================================================
Alan J. Robinson, D.Phil. Tel:+44-(0)1223 494444
European Bioinformatics Institute Fax:+44-(0)1223 494468
EMBL Outstation - Hinxton Email: alan@ebi.ac.uk
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, UK http://industry.ebi.ac.uk/~alan/
============================================================