[BioPython] bioperl idl

Sun, 03 Oct 1999 22:59:06 -0600

Sorry all, I've been working on (gasp!) work stuff and haven't had much
time for biopython.org.  You'll be getting a slew of emails as I try
to catch up this evening.  There might also be repeats since I've forgotten
what I've said before.

Ewan:
> actually in bioperl I am using moltype. (molecular type).

I may have mentioned this before.  I would prefer seqtype to moltype, since
there are other types of molecules (water, carbohydrates, lipids) which
may be important.

> I think this is pretty good. I was thinking of something similar. It is
> just that Comments in most bioinformatics databases are best represented
> as a list of strings, each string a line. This is the natural way to
> process them. Is 
> 
> sequence <string> comments
> string mime_type
> 
> so bad?

Yes, because the client has to determine what to do to merge multiple lines.
Should they always be merged with a newline added?  Or without?  Some
file formats mandate the newline seperator character (eg, rfc822 email)
while some do not, so the process of merging is related to the content-type,
and there must be agreement between the client and server.  So I think it
is better to put everything in one string, since then the parser for the
mime type is the sole arbitor of what to do.

But considering your case, where you have a set of lines from a data file,
there's also the problem where extra "2D" information is added to the text
because of the known representation in a given file format.  For example,
I've seen ASCII drawings of chemical structures placed in comment strings,
which make it more understandable if looking at the presentation in the
original format, but not something which can be understood without
preserving the order.

Still, it can be parsed and sent back as a string (perhaps not text/plain,
given the newline considerations?) in such a way as to get the same information,
so my reverse question is, is mergeing the set of strings
into one string so bad?

> I'm just not trying to be super clever. Perhaps one should have
> 
> DBxREF {
>   string URN;
>   string database;
>   string primary_key;
> };

Jeff Chang pointed out URIs in one of his reponses.  They define
the generic syntactic description for which URLs and URNs work.  I
haven't thought about this much over the last two years, though I
read the RFC he pointed out.  My thought was to have a well defined
means to have the database + primary_key be expressed in the URI.
For example:
  genbank:/locus/dmu66884
  genbank:/accession/u66884
  prosite:/accession/PS00365

and above that have a restriction that "accession" or "primary_key"
or something is always the primary key, which could be an alias for
searching on the "ac" or whatever fields.

As it turns out, we needed a facility for this on our local systems,
where we have data on flat files or on a database.  We tried out
saying something like:
  repository:/path/to/data/
  oracle:/way/to/get/oracle/data

and it seems to work.

Jeff said:
> I don't know if URI's will take off.  However, in XML, URI's are the
> preferred way to reference external information that's either
> inappropriate to reproduce or hard to encode (binary data).

I looked at that for a tiny little bit, but got confused quickly because
most of the pages I found were how URIs are used for namespaces.   I
can see the relationship, but there's probably something deeper on how
they use it then I've figured out yet.

						Andrew
						dalke@acm.org