[BioPython] bioperl idl

Andrew Dalke dalke@bioreason.com
Sun, 19 Sep 1999 21:56:46 -0600


[Ewan's IDL]

Let my start by asking a dumb question.  What does strand type (as in
+ or -) mean?  Or do you have a URL for me (a protein person by
training) to read?


You have a type() method to get the sequence type.  What about
picking another name?  Python defines a "type()" function, which
is not a reserved word and will never have a conflict with the
method name.

However, implementations might want to do:

class Seq:
  def __init__(seq, type):
    self._seq = seq
    self._type = type

  def seq(self):
    return self._seq

   ...

so there is a point where the local "type" hides the local
definition.  Again, langauge-wise it doesn't make a difference,
but it might be confusing.

An alternate name might be "seq_type".


You have a "Comment" string data type which is

>  // just a list of strings.
>  interface Comment : ReleaseableObject {
>    sequence <string> comments;
>    bool is_html;
>  }

This doesn't feel right, for a couple of reasons.  Why is it a list of
strings, rather than a single string separated by newlines?  Is it the
cross-platform newline issue?  I would rather have things as a single
string (with a defined "\n", "\r" or "\r\n") than lots of strings.

Also, the is_html isn't very nice.  What about support for other
formats?  This is exactly the problem that MIME solves, so you could
have

  interface Comment : ReleaseableObject {
    string mime_type;
    string comment;
  }

and let that be all. (For fun reading, the MIME media types are given
in RFC 2046, at http://www.cis.ohio-state.edu/htbin/rfc/rfc2046.html).


Of course, then there are issues of arbitration,
when the client only understands HTML and text and the server has
LaTeX, but then that's a problem RFC 2295 has done in the http world
(http://www.cis.ohio-state.edu/htbin/rfc/rfc2295.html).

Or questions of who does the translation.  Take PDB comments, which
have a special notation for doing superscript and subscript.  Should
the server do the translation to HTML or should it provide a
"text/pdb-comment" and let the client do the translation.  Old
problem.  No solution except to depend on conventions.

(Hmm.., or provide a translation service using CORBA ...)


More generally, in the IDL and in the bioperl Seq.pm update, you toss
around various identifiers, like database, primary_key, and accession.
I want to bring up something I looked into a couple of years ago that
I never had time to follow up on, URNs.

URL stands for Uniform Resource Location (or "Universal" if you've
been around long enough :).  URN stands for Uniform Resource Names
(RFC 2141 at http://www.cis.ohio-state.edu/htbin/rfc/rfc2141.html).

The idea is that many different places serve essentially the same
object.  For example, you can get PDB files from the RCSB or from the
MSD or from your local repository, so the same name, like

  pdb:2plv

could be translated to http or ftp or even file URLs.  I recall also
discussion on how to handle content negotiation.  With a properly
configured setup (browser and URN resolver) you are supposed to be
able to be redirected to different sites if the primary URL is dead or
has been moved.

I was interested in standard because there are many places that
provide the same information, and I wanted to have some way to say
that a SWISS-PROT identifier could be resolved to any one of *these*
places, and additionally customize the responses to point to the
fastest one.

This isn't something new, BTW.  It's exactly analagous to the way NCBI
labels their sequences, as in sp|THIS|THAT.  But it is more
generalized with some thought behind it on use as a general naming
scheme.

URNs per se have not taken off like I hoped they would.  The problem
is in setting up distributed name resolvers.  There are a couple of
ways to do it, but for the whole web will require a large
infrastructure which is very, very difficult to set up and maintain.

But for a relatively limited domain, like bioinformatics, it becomes
much easier.

So what I got out of the URN documents was to convert data types (like
database id references) into an unambiguous name, and the URN RFC
defines the format of the name, and provide facilities to resolve the
name as needed.


How does this apply to the IDL?  Glad you asked :) Here are some of
the identifier types in the IDL:


>  interface Dbxref : ReleaseableObject {
>    string database(void);
>    string primary_key(void);
>  }

>  interface LiteratureReference : Dbxref {
>    string title;
>    string author_line;
>    string location;
>  };

in the Seq interface:
>    string id(void); // human readable name
>    string accession(void); // computer assigned name - see below

>  interface SeqFeature : Range , ReleaseableObject {
>    string primary_key(void);
>    string source_key(void);

>  interface SearchResult : ReleaseableObject {
>    string query_id(void);
>    string library_name(void);

At present these are all opaque strings.  I propose that these be
better defined, specifically to make them like URNs (also, I think I
heard some discussion in the OMG meeting about a similar OMG spec?)

Then you need but one resolver per class of data types, which, yes, is
exactly what's already present with things like

>   interface SeqDB : ReleaseableObject {
>    Seq get_Seq_by_id(in string id);
>    Seq get_Seq_by_acc(in string acc);
>  };

What I'm suggesting is purely a semantic *convention* built on top of
the existing interfaces.

Oh, BTW, in the comment for LiteratureReference you say:
> The primary key here(medline number) is considerably less
> informative

You can't limit yourself to medline, since some places have references
not indexed by medline, for example, internal documentation.  All it
needs to be is the name which can be resolved to get the right sort of
information.

I also believe that by using sufficiently well-formed names, some of
the interfaces can be pushed into one call, like Dbxref, where the
database and primary_key can be represented as one chunk instead of
two, so long as you always know how to split them apart.

However, as I've mentioned half-a-dozen times now, I don't have much
time to work on this 'cause I've got a Real Job [tm] which isn't
bioinformatics related.  :(

Still, I hope it's given people something to think about!

						Andrew
						dalke@bioreason.com