[BioPython] IDL proposal
Ewan Birney
birney@ebi.ac.uk
Fri, 28 Jan 2000 18:22:28 +0000 (GMT)
This is a proposed IDL for collaboration between the three
projects and other projects. It is small, simple, I hope uncontroversial
and implementable by free-ware ORBs. Please take some time to read
it and/or pass to people with Object/CORBA/Java experience in the
groups around you.
Please post follow-ups to the bioperl-guts mailing list, and not all
three lists.
I want to say somethings up front to answer some questions people might
have:
a) These idls are for software projects in bioinformatics. They *do not*
attempt to model biology in anyway. It is just programming aids.
b) The scope of the idls are small, LAN-wide type networks where transport
costs are not high. Hence the use of interfaces. Do not expect them to
work well for internet wide connections (though they might).
c) If you agree with some points in the IDL, please post your *agreement*.
It is really easy for people to focus on disagreements in discussion,
not the agreements.
d) please look for consensus. IF this IDL looks sensible just say "it
looks good". Do comment on issues when they are there.
e) I was a co-author on the OMG IDL. This IDL is designed for different
problems with freeware technology - hence the different IDL. They should
play well together - it is not a replacement in anyway for the OMG
standard.
Ok. Enough faff. Here is the specification... Enjoy
// Notes on the IDL.
//
// This IDL is designed to be pretty uncontroversial and simple. I am
// using very simple CORBA features, and so no valuetypes, any's,
// const's all of which get implemented late in ORBs (if at all).
//
// This IDL is designed only for sequences and features. There is no
// provision for other stuff - we need to start with the things we can
// agree with and build on that.
//
// This IDL should work well as an internal IDL for an OMG compliant
// server. The OMG specification leaves alot of the "magic" to the
// server, including memory management etc, and also uses alot of
// "standard" OMG types and services which generally do not come
// with a standard free ORB. Hence the GNOME memory management
// model and the simple iterators.
//
// <birney@ebi.ac.uk>
//
//
// This comes directly from the GNOME bonoboo model.
// It allows memory management via ref and unref calls.
// The query_interface is not important for this case,
// but here for completeness.
//
module GNOME {
interface Unknown {
void ref(); // up the reference count on an object
void unref();// down the reference count on an object
Object query_interface(in string repoid);
};
};
//
// These are the actual biological objects that we are interested
// in. Nearly everything is an interface. It is not going to work
// well across large internet connections, so don't use it for that.
//
module Bio {
exception RequestTooLarge { string reason; }; // means you need to request a smaller number,
// ie, request only failed due to its size.
exception OutOfRange { string reason; } // For when start/end points are out of range.
exception EndOfStream { }; // for end of streams
exception UnableToProcess { string reason; }; // All other errors
enum SeqType { PROTEIN,DNA,RNA };
// Primary sequences are just the sequence information and enough to idenity information
// to process the sequence/store results/etc
interface PrimarySeq : GNOME::Unknown {
SeqType type(); // server has to at least *guess* the type.
long length();
// the entire sequence. Use max_request_length to find the max size allowed
string get_seq() raises (RequestTooLarge);
// gets a sub sequence. the start,end are in biological coordinates, ie, 1-2 are
// the first two bases
string get_subseq(in long start,in long end) raises (OutOfRange,RequestTooLarge);
// two different id's which might be the same. The first, display id is what to
// use if a human uses it. The second, primary_id is what the implementation decides
// as the correct unique id for this sequence. (in alot of cases this will be accession
// number). We might need a separate accession number method
string display_id(); // id to display
string primary_id(); // id to use as a unique id for this sequence. in some cases it could be
// byte position for example (ie, this is the unique id used below)
long version(); // potential (unstable) version number for the sequence. Can be 0
// This is to find the largest string that can be passed back
long max_request_length();
};
// Represents streaming through a single database, eg over a fasta file
// Don't forget to deference objects once they are done
interface PrimarySeqStream : GNOME::Unknown {
PrimarySeq next_seq() raises (EndOfStream,UnableToProcess);
boolean at_end(); // returns 1 when next_seq will give EndOfStream
};
// Provides a database mainly for database searching. Can make new
// streams and can retrieve sequences from the database.
interface PrimarySeqDB : GNOME::Unknown {
string database_name(); // This is to identify databases by name
short database_version(); // version of the database
PrimarySeqStream make_stream(); // makes a new stream object.
PrimarySeq get_primaryseq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
};
// We need to be able to pass back additional structured information
// in some cases. This gives us a way of doing it without specifying
// the structure at compile time. Try not to abuse this...
// This is equivalent to a hash of arrays in perl
typedef sequence <string> stringList;
struct NameValueSet {
string name;
stringList values;
};
// SeqFeatures are features on a sequence. This is GFF
// compatible.
interface SeqFeature : GNOME::Unknown {
string type(); // exon, repeat etc.
string source(); // source of the SeqFeature mainly for GFF compatibility
string seq_primary_id(); // This gives the primary sequence id this is linked to.
long start(); // start in biological coordinates (1 is the first base)
long end(); // end in biological coordinates (1-2 are the first two bases in a sequence)
short strand(); // -1,0,1. -1 means reverse, 0 means either, 1 means forward. Irrelevant for proteins
NameValueSetList qualifiers(); // additional structured information
boolean has_PrimarySeq(); // returns 1 if it does, 0 if not.
PrimarySeq get_PrimarySeq() raises ( UnableToProcess ); // the Sequence may or may not be there.
// implementors are free to choose
};
// We have to handle large numbers of features.
interface SeqFeatureIterator : GNOME::Unknown {
SeqFeature next_feature();
boolean at_end();
};
// Yes we should inheriet of SeqFeature for more complex things. Please
// inheriet off SeqFeature for your favourtie feature extension!
// This is one heavy object! This should really be a number of
// coordinating objects underneath. Notice that the Seq object
// both inheriets from the PrimarySeq interface and also has-a
// PrimarySeq interface. This is deliberate so that clients can
// indicate when they really want to discard a complete sequence
// with features by freeing but still hold on to the original
// primary sequence.
// otherwise servers will have extremely large objects for every
// sequence in feature rich databases (bad).
interface Seq : PrimarySeq {
SeqFeatureList all_features() raises (RequestTooLarge);
SeqFeatureIterator all_features_iterator();
SeqFeatureList features_region(in long start,in long end)
raises (OutOfRange,UnableToProcess,RequestTooLarge);
SeqFeatureIterator features_region_iterator(in long start,in long end)
raises (OutOfRange,UnableToProcess);
long max_feature_request();
// This is put here so that clients can ask servers just for the
// sequence and then free the large, seqfeature containing sequence.
// It prevents a sequence with features having to stay in memory for ever.
PrimarySeq get_PrimarySeq();
};
interface SeqDB : PrimarySeqDB {
Seq get_seq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
};
};
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230
<birney@ebi.ac.uk>
-----------------------------------------------------------------