[BioPython] IDL semi-finalised

Fri, 11 Feb 2000 12:34:22 +0000

With Matt (from biojava) and Kim Rutherford (from Artemis) help we have a
semi-finalised IDL. The IDL is more java friendly and provides a real
"just-the-sequence" object, called AnonymousSeq., designed so that people
who want to declare methods that *just work on sequences* without any
other information can get hold of them.

[there were alot of other interesting discussions about iterators and
databases which I wont bore you with].

I have updated the bioperl-corba-server distribution to work with this
IDL, so there is one working server people can download.

I would like the three different projects to generate a number of clients
and servers to this IDL so that we can really start throwing objects
around between them. Once this has happened then we can re-evaluate the
IDL on this basis and provide a final, frozen, IDL.

I would also be interested then in recruiting a team to work on providing
a stable, java based bridge between this IDL and the OMG-BSA IDL. I think
this would be a great project for a infrastructure company who wanted to
show that it was serious about supporting free software. (that's a heavy
hint to some people on this list).

IDL below.

// Notes on the IDL. 
//
// This IDL is designed to be pretty uncontroversial and simple. I am
// using very simple CORBA features, and so no valuetypes, any's,
// const's all of which get implemented late in ORBs (if at all).
//
// This IDL is designed only for sequences and features.  There is no
// provision for other stuff - we need to start with the things we can
// agree with and build on that.
//
// This IDL should work well as an internal IDL for an OMG compliant
// server. The OMG specification leaves alot of the "magic" to the
// server, including memory management etc, and also uses alot of
// "standard" OMG types and services which generally do not come
// with a standard free ORB. Hence the GNOME memory management
// model and the simple iterators.
//
// <birney@ebi.ac.uk>
//

//
// This comes directly from the GNOME bonoboo model.
// It allows memory management via ref and unref calls.
// The query_interface is not important for this case,
// but here for completeness.
//

module GNOME {
  interface Unknown {
    void ref();
    void unref();
    Object query_interface(in string repoid);
  };
};

//
// These are the actual biological objects that we are interested
// in. Nearly everything is an interface. It is not going to work
// well across large internet connections, so don't use it for that.
//

// The org.Biocorba.Seqcore package is so we look good in Java.
// Makes the C interface names waaaaay too long of course

module org {
  module Biocorba  {
    module Seqcore 
      // changed indentation to give us more space for the main text
{

  exception RequestTooLarge { string reason; 
  long suggested_size; }; 
  // means you need to request a smaller number,
  // ie, request only failed due to its size.

  exception OutOfRange { string reason; };      // For when start/end points are out of range.
  exception EndOfStream { };                    // for end of streams
  exception UnableToProcess { string reason; }; // All other errors

  enum SeqType { PROTEIN,DNA,RNA };

  // AnonymousSeq is just the sequence informaiton and *nothing else*
  // including names

  interface AnonymousSeq : GNOME::Unknown {
    SeqType type(); // server has to at least *guess* the type.
    long    length();

    // the entire sequence. Use max_request_length to find the max
    // size allowed
    string  get_seq() raises (RequestTooLarge);

    // gets a sub sequence. the start,end are in biological
    // coordinates, ie, 1-2 are the first two bases
    string  get_subseq(in long start,in long end) raises (OutOfRange,
							  RequestTooLarge);
    // This is to find the largest string that can be passed back
    long    max_request_length(); 
  };

  // Primary sequences are just the sequence information and enough to
  // idenity information to process the sequence/store results/etc.

  interface PrimarySeq : AnonymousSeq {

    // three different id's which might be the same. The first,
    // display id is what to use if a human uses it. The second,
    // primary_id is what the implementation decides as the correct
    // unique id for this sequence. (in alot of cases this will be
    // accession number). The final one is the accession number which
    // is the unique id in the biological database which it is from
    // (this maybe the same as the primary_id, but might not). Yes -
    // we do need all three ids.

    string  display_id();       // id to display to humans 
    string primary_id();        // id to use as a unique id for this
                                // sequence. in some cases it could be
                                // byte position/file munged into a string for example 
    string  accession_number();  // The unique id (commonly called accession number) in
                                // the biological database this comes from, not the particular
                                // instance of the database for the implementation.

    long    version();    // potential (unstable) version number for the sequence. 0 for
    // things that don't have a version number

  };

  // Represents streaming through a single database, eg over a fasta file
  // Don't forget to deference objects once they are done
  interface PrimarySeqIterator : GNOME::Unknown {
    PrimarySeq next() raises (EndOfStream,UnableToProcess);
    boolean    has_more(); // returns 1 when next_seq will give an object
  };

  // Provides a database mainly for database searching. Can make new
  // streams and can retrieve sequences from the database.
  interface PrimarySeqDB : GNOME::Unknown {
    string  database_name();     // This is to identify databases by name
    short   database_version();  // version of the database
    PrimarySeqIterator make_PrimarySeqIterator(); // makes a new iterator object.
    PrimarySeq get_PrimarySeq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
  };

  // We need to be able to pass back additional structured information
  // in some cases. This gives us a way of doing it without specifying 
  // the structure at compile time. Try not to abuse this...

  // This is equivalent to a hash of arrays in perl
  typedef sequence <string> stringList;
  struct NameValueSet {
    string name;
    stringList values;
  };

  typedef sequence <NameValueSet> NameValueSetList;

  // SeqFeatures are features on a sequence. This is GFF
  // compatible. 

  interface SeqFeature : GNOME::Unknown {
    string type();           // exon, repeat etc.
    string source();         // source of the SeqFeature mainly for GFF compatibility
    string seq_primary_id(); // This gives the primary sequence id this is linked to.
    long start();            // start in biological coordinates (1 is the first base)
    long end();              // end in biological coordinates (1-2 are the first two bases in a sequence)
    short strand();          // -1,0,1. -1 means reverse, 0 means either, 1 means forward. Irrelevant for proteins
    NameValueSetList qualifiers(); // additional structured information
    boolean    PrimarySeq_is_available();   // returns 1 if it does, 0 if not.
    PrimarySeq get_PrimarySeq() raises ( UnableToProcess ); // the Sequence may or may not be there.
                                                            // implementors are free to choose
  };

  typedef sequence <SeqFeature> SeqFeatureList;

  // We have to handle large numbers of features.
  interface SeqFeatureIterator : GNOME::Unknown {
    SeqFeature next() raises (EndOfStream,UnableToProcess);
    boolean    has_more();
  };

  // Yes we should inheriet of SeqFeature for more complex things. Please 
  // inheriet off SeqFeature for your favourtie feature extension!

  // This is one heavy object, This should really be a number of
  // coordinating objects underneath. Notice that the Seq object
  // both inheriets from the PrimarySeq interface and also has-a
  // PrimarySeq interface. This is deliberate so that clients can
  // indicate when they really want to discard a complete sequence
  // with features by freeing but still hold on to the original 
  // primary sequence.

  // otherwise servers will have extremely large objects for every
  // sequence in feature rich databases (bad).

  interface Seq : PrimarySeq {
    SeqFeatureList     all_features() raises (RequestTooLarge);
    SeqFeatureIterator all_features_iterator();

    SeqFeatureList     features_region(in long start,in long end) 
                             raises (OutOfRange,UnableToProcess,RequestTooLarge);
    SeqFeatureIterator features_region_iterator(in long start,in long end) 
                             raises (OutOfRange,UnableToProcess);

    long               max_feature_request();

    // This is put here so that clients can ask servers just for the
    // sequence and then free the large, seqfeature containing sequence.
    // It prevents a sequence with features having to stay in memory for ever.
    PrimarySeq         get_PrimarySeq(); 
  };  

  typedef sequence <string> primaryidList;

  interface SeqDB : PrimarySeqDB {
    Seq get_Seq(in string primary_id) raises (UnableToProcess); // Retrieves one sequence
    primaryidList get_primaryidList();
  };

}; // end Seqcore module
  }; // end Biocorba module
}; // end org module

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230
<birney@ebi.ac.uk>
-----------------------------------------------------------------