[BioPython] URNs, URIs, URLs, content negotiation and all that

Thu, 17 May 2001 17:46:51 -0600

Hello,

  For a few years now I've been interesting in working
with URNs (RFC 2141).  I'm looking for pointers to available
Python software for dealing with them.  I can't find any
hints on Google or the Vaults, and I'm pretty sure there
haven't been any newsgroups posts over the last couple years ...
nope, not in groups.google.

  I'm also interested in tools for content negotiation
(RFC 2295).

  Here's a description of what I'm doing - I say it here
to make sure I'm looking for the right things.

  What I have a lot of bioinformatics records, which have
various identifiers
  SWISS-PROT entry 100K_RAT
  SWISS-PROT accession Q62671
  PDB id 2PLV
   ...

  These can be on the local machine or available on a web site.
If on the local machine it might be retrieved with
  popen("zcat sprot39.tar.gz | head -12345 | tail -50")
  open("/local/databases/swiss/acc/Q62671")
or through the web with
  http://ca.expasy.org/cgi-bin/get-sprot-raw.pl?Q62671
or perhaps through an intranet server.

I would like to centralize requests for a document by using
a URN, so I might ask for one of
   urn:bio:swiss-prot/accession/Q62671
   urn:bio:swiss-prot/entry/100K_RAT
   urn:bio:pdb/id/2PLV
and get an object that can be used no matter the data source.

This requires some sort of:
  - standard object for dealing with different resources
      (files, urls, in-memory data structures, MySQL, ...)
  - registry system for resolving the URN (though in this
      case the NID alway be "bio" so the resolver works one
      level further down in the NSS)

I'm also interested in work done on content negotiation.
As described above, only a single document is returned.
In truth, people may want the document converted to a
different format, like:
   text/plain -- for the original record
   text/html -- for conversion to HTML
   bio/swiss-prot -- for (the null) conversion to SWISS-PROT
        format
   bio/fasta -- for conversion to FASTA (another format)
   bio/fasta; header=NCBI -- for conversion to FASTA using
         NCBI's header convention
   ...

What libraries are there for doing content negotiation on
the server side?

Finally, what's the standard way to access different versions
of the same record?  For example, there are versions on the
database or sometimes a version on the record itself.  I can
treat the version as opaque - it's usually a number or a date.

I think I start from the URN for the versionless, which is
mapped to the most recent (or most appropriate) URL.  Part of
the return data contains metadata for the versioned URN.

Comments?  Pointers?  Ideas?  Code?

Thanks!
                    Andrew Dalke
                    dalke@acm.org