[BioPython] URNs, URIs, URLs, content negotiation and all that
Andrew Dalke
dalke@acm.org
Thu, 17 May 2001 17:46:51 -0600
Hello,
For a few years now I've been interesting in working
with URNs (RFC 2141). I'm looking for pointers to available
Python software for dealing with them. I can't find any
hints on Google or the Vaults, and I'm pretty sure there
haven't been any newsgroups posts over the last couple years ...
nope, not in groups.google.
I'm also interested in tools for content negotiation
(RFC 2295).
Here's a description of what I'm doing - I say it here
to make sure I'm looking for the right things.
What I have a lot of bioinformatics records, which have
various identifiers
SWISS-PROT entry 100K_RAT
SWISS-PROT accession Q62671
PDB id 2PLV
...
These can be on the local machine or available on a web site.
If on the local machine it might be retrieved with
popen("zcat sprot39.tar.gz | head -12345 | tail -50")
open("/local/databases/swiss/acc/Q62671")
or through the web with
http://ca.expasy.org/cgi-bin/get-sprot-raw.pl?Q62671
or perhaps through an intranet server.
I would like to centralize requests for a document by using
a URN, so I might ask for one of
urn:bio:swiss-prot/accession/Q62671
urn:bio:swiss-prot/entry/100K_RAT
urn:bio:pdb/id/2PLV
and get an object that can be used no matter the data source.
This requires some sort of:
- standard object for dealing with different resources
(files, urls, in-memory data structures, MySQL, ...)
- registry system for resolving the URN (though in this
case the NID alway be "bio" so the resolver works one
level further down in the NSS)
I'm also interested in work done on content negotiation.
As described above, only a single document is returned.
In truth, people may want the document converted to a
different format, like:
text/plain -- for the original record
text/html -- for conversion to HTML
bio/swiss-prot -- for (the null) conversion to SWISS-PROT
format
bio/fasta -- for conversion to FASTA (another format)
bio/fasta; header=NCBI -- for conversion to FASTA using
NCBI's header convention
...
What libraries are there for doing content negotiation on
the server side?
Finally, what's the standard way to access different versions
of the same record? For example, there are versions on the
database or sometimes a version on the record itself. I can
treat the version as opaque - it's usually a number or a date.
I think I start from the URN for the versionless, which is
mapped to the most recent (or most appropriate) URL. Part of
the return data contains metadata for the versioned URN.
Comments? Pointers? Ideas? Code?
Thanks!
Andrew Dalke
dalke@acm.org