[BioPython] IDL proposal

Fri, 28 Jan 2000 18:22:28 +0000 (GMT)

This is a proposed IDL for collaboration between the three
projects and other projects. It is small, simple, I hope uncontroversial
and implementable by free-ware ORBs. Please take some time to read
it and/or pass to people with Object/CORBA/Java experience in the
groups around you.

Please post follow-ups to the bioperl-guts mailing list, and not all
three lists.

I want to say somethings up front to answer some questions people might
have:

a) These idls are for software projects in bioinformatics. They *do not*
attempt to model biology in anyway. It is just programming aids.

b) The scope of the idls are small, LAN-wide type networks where transport
costs are not high. Hence the use of interfaces. Do not expect them to
work well for internet wide connections (though they might).

c) If you agree with some points in the IDL, please post your *agreement*.
It is really easy for people to focus on disagreements in discussion,
not the agreements.

d) please look for consensus. IF this IDL looks sensible just say "it
looks good". Do comment on issues when they are there.

e) I was a co-author on the OMG IDL. This IDL is designed for different
problems with freeware technology - hence the different IDL. They should
play well together - it is not a replacement in anyway for the OMG
standard.

Ok. Enough faff. Here is the specification... Enjoy

// Notes on the IDL. 
//
// This IDL is designed to be pretty uncontroversial and simple. I am
// using very simple CORBA features, and so no valuetypes, any's,
// const's all of which get implemented late in ORBs (if at all).
//
// This IDL is designed only for sequences and features.  There is no
// provision for other stuff - we need to start with the things we can
// agree with and build on that.
//
// This IDL should work well as an internal IDL for an OMG compliant
// server. The OMG specification leaves alot of the "magic" to the
// server, including memory management etc, and also uses alot of
// "standard" OMG types and services which generally do not come
// with a standard free ORB. Hence the GNOME memory management
// model and the simple iterators.
//
// <birney@ebi.ac.uk>
//

//
// This comes directly from the GNOME bonoboo model.
// It allows memory management via ref and unref calls.
// The query_interface is not important for this case,
// but here for completeness.
//

module GNOME {
  interface Unknown {
    void ref();  // up the reference count on an object
    void unref();// down the reference count on an object
    Object query_interface(in string repoid);
  };
};

//
// These are the actual biological objects that we are interested
// in. Nearly everything is an interface. It is not going to work
// well across large internet connections, so don't use it for that.
//

module Bio  {

  exception RequestTooLarge { string reason; }; // means you need to request a smaller number,
                                                // ie, request only failed due to its size.
  exception OutOfRange { string reason; }       // For when start/end points are out of range.
  exception EndOfStream { };                    // for end of streams
  exception UnableToProcess { string reason; }; // All other errors

  enum SeqType { PROTEIN,DNA,RNA };

  // Primary sequences are just the sequence information and enough to idenity information
  // to process the sequence/store results/etc
  interface PrimarySeq : GNOME::Unknown {
    SeqType type(); // server has to at least *guess* the type.
    long    length();

    // the entire sequence. Use max_request_length to find the max size allowed
    string  get_seq() raises (RequestTooLarge);

    // gets a sub sequence. the start,end are in biological coordinates, ie, 1-2 are
    // the first two bases
    string  get_subseq(in long start,in long end) raises (OutOfRange,RequestTooLarge);

    // two different id's which might be the same. The first, display id is what to 
    // use if a human uses it. The second, primary_id is what the implementation decides
    // as the correct unique id for this sequence. (in alot of cases this will be accession
    // number). We might need a separate accession number method

    string  display_id(); // id to display 
    string  primary_id(); // id to use as a unique id for this sequence. in some cases it could be
                          // byte position for example (ie, this is the unique id used below)
    long    version();    // potential (unstable) version number for the sequence. Can be 0

    // This is to find the largest string that can be passed back
    long    max_request_length(); 
  };

  // Represents streaming through a single database, eg over a fasta file
  // Don't forget to deference objects once they are done
  interface PrimarySeqStream : GNOME::Unknown {
    PrimarySeq next_seq() raises (EndOfStream,UnableToProcess);
    boolean    at_end(); // returns 1 when next_seq will give EndOfStream
  };

  // Provides a database mainly for database searching. Can make new
  // streams and can retrieve sequences from the database.
  interface PrimarySeqDB : GNOME::Unknown {
    string  database_name();     // This is to identify databases by name
    short   database_version();  // version of the database
    PrimarySeqStream make_stream(); // makes a new stream object.
    PrimarySeq get_primaryseq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
  };

  // We need to be able to pass back additional structured information
  // in some cases. This gives us a way of doing it without specifying 
  // the structure at compile time. Try not to abuse this...

  // This is equivalent to a hash of arrays in perl
  typedef sequence <string> stringList;
  struct NameValueSet {
    string name;
    stringList values;
  };

  // SeqFeatures are features on a sequence. This is GFF
  // compatible. 

  interface SeqFeature : GNOME::Unknown {
    string type();           // exon, repeat etc.
    string source();         // source of the SeqFeature mainly for GFF compatibility
    string seq_primary_id(); // This gives the primary sequence id this is linked to.
    long start();            // start in biological coordinates (1 is the first base)
    long end();              // end in biological coordinates (1-2 are the first two bases in a sequence)
    short strand();          // -1,0,1. -1 means reverse, 0 means either, 1 means forward. Irrelevant for proteins
    NameValueSetList qualifiers(); // additional structured information
    boolean    has_PrimarySeq();   // returns 1 if it does, 0 if not.
    PrimarySeq get_PrimarySeq() raises ( UnableToProcess ); // the Sequence may or may not be there.
                                                            // implementors are free to choose
  };

  // We have to handle large numbers of features.
  interface SeqFeatureIterator : GNOME::Unknown {
    SeqFeature next_feature();
    boolean    at_end();
  };

  // Yes we should inheriet of SeqFeature for more complex things. Please 
  // inheriet off SeqFeature for your favourtie feature extension!

  // This is one heavy object! This should really be a number of
  // coordinating objects underneath. Notice that the Seq object
  // both inheriets from the PrimarySeq interface and also has-a
  // PrimarySeq interface. This is deliberate so that clients can
  // indicate when they really want to discard a complete sequence
  // with features by freeing but still hold on to the original 
  // primary sequence.

  // otherwise servers will have extremely large objects for every
  // sequence in feature rich databases (bad).

  interface Seq : PrimarySeq {
    SeqFeatureList     all_features() raises (RequestTooLarge);
    SeqFeatureIterator all_features_iterator();

    SeqFeatureList     features_region(in long start,in long end) 
                             raises (OutOfRange,UnableToProcess,RequestTooLarge);
    SeqFeatureIterator features_region_iterator(in long start,in long end) 
                             raises (OutOfRange,UnableToProcess);

    long               max_feature_request();

    // This is put here so that clients can ask servers just for the
    // sequence and then free the large, seqfeature containing sequence.
    // It prevents a sequence with features having to stay in memory for ever.
    PrimarySeq         get_PrimarySeq(); 
  };  

  interface SeqDB : PrimarySeqDB {
    Seq get_seq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
  };
};

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230
<birney@ebi.ac.uk>
-----------------------------------------------------------------