[Fwd: RE: [BioPython] versioning, distribution and webdav (fwd)]

Eugene.Leitl@lrz.uni-muenchen.de Eugene.Leitl@lrz.uni-muenchen.de
Tue, 15 May 2001 23:19:31 +0200


-------- Original Message --------
From: "Lisa Dusseault" <lisa@xythos.com>
Subject: RE: [BioPython] versioning, distribution and webdav (fwd)
To: "Eugene Leitl" <Eugene.Leitl@lrz.uni-muenchen.de>,<transhumantech@egroups.com>,
<transhumantech@planetx.com>,"forkit!" <fork@xent.com>

> A completely different idea, and which could be more practical,
> is to improve how updates are pushed about.  Presently when there
> is an update everyone needs to download the complete data file.
> Why isn't there a way to get deltas?  If possible, that means
> the transfer rate is the derivative of the growth rate, which
> is still exponential but it saves almost a year.

Take a look at the RSync work and the RProxy designs which are for using the
RSync algorithm over HTTP.  I've been exposed to a bunch of this kind of
thing over the years, and I like the rsync algorithm best for most synch
situations I've seen, because most synch situations involve limited space.
(A counter-example would be a source-code repository, where old versions of
all files are, or should be, kept around: in this situation another
algorithm may be preferable to reduce processing costs.)

Why I like rsync/rproxy:
 - It doesn't require all old versions or diffs to old versions to be stored
on the server
 - It can work if any two of [client, proxy, server] support the concept
I did some work last year on extending rsync/RProxy to do upload as well as
download over HTTP, and you'll find links from there to the original work.
http://www.sharemation.com/~milele/public/rsync-specification.htm

> One such protocol already exists which looks promising: WebDAV.
> See http://www.webdav.org/ . If my limited understanding of it
> is correct, it should be possible to treat each record as its
> own document, and update only the records or parts of the record
> that changed.

Absolutely.  But, despite my own bias towards WebDAV (Xythos builds a WebDAV
server and web document storage platform called Xythos WebFile Server),
frankly HTTP may serve your purpose as well as WebDAV.  Either one would
need some extensions to be truly great at this application.

But even without extensions to HTTP (including WebDAV), as long as the
bio-data was formatted as separate web entities (separate URLs) users need
only download those with changed ETags.  (See the HTTP 1.1 spec,
http://www.ietf.org/rfc/rfc2616.txt, section 14.19).  That would presumably
help a great deal in this scenario, because at least if the files are broken
up intelligently, users would only have to download files that had changed.

Also, there is work to extend HTTP to use diff algorithms to download
smaller pieces in cases where files are changed:
http://www.ietf.org/internet-drafts/draft-mogul-http-delta-08.txt.  It now
mentions the rsync algorithm but hasn't speced out how to use it, probably
because it's somewhat different from regular diffs (the client has to send
up a special signature along with the request, rather than just requesting
the diff).

> Looking at the webdav.org page, it mentions that DAV servers can
> be mounted on MS Windows and Mac OS X.  I haven't looked into
> what that means, but it suggests a potential way for existing file
> oriented software to work with DAV-based systems.

And linux/*ix -- mod_dav is free module for Apache.  Yes, some WebDAV
servers expose the existing file system as WebDAV collections and resources.
If there is limited need for extensive metadata, these solutions are super.

> One thing that comes to mind is to operate a honking proxy/caching
> server.  This is interesting because it is scalable.  If you are
> only interested in a small part of the database, it stays on the
> local network.  If you configure it for updating, it would update
> those files on demand.  Yet if you want all the database it could
> download all of the files for local access and only update the
> deltas when things change.  All data acts local.

And this would help the bio-data scenario a great deal!  It depends to a
large extent on how the database is currently structured -- if it currently
doesn't have a way to mark data changing (a modified-date, an etag, diffs,
or some such mechanism) then it would be much harder to put in place a
proxy/caching server.

> Of course, this is also related to the CORBA work, and I don't know
> enough about it to judge the overlap.  My ungrounded feeling is that
> it may be too heavyweight for this specific task.

I agree.  In fact, WebDAV may be too heavyweight for this specific task.

> Anyway, those are thoughts this really late night/early morning.
> (If this ends up sounding like the ravings of a crazed lunatic,
> I'll just blame it on a bad sleep schedule :)

No, you've touched on a number of very important issues that are only now
starting to be broadly dealt with. THe WebI group at the IETF is also
working on related topics, as are the group of authors working on the
various 'mogul' drafts.  It's a bit scary because there's a great deal to be
gained by standardizing on one (or at least not too many) delta format(s),
yet it's hard to know what the right one is for all situations (given
tradeoffs between processing power, storage, and bandwidth availability).

The subject of this email mentions "versioning".  I don't know how important
it is to actually have old versions around -- do people really want to be
able to go back to old versions?  If not, then versioning is probably
unnecessary.  But if it is necessary, then the WebDAV group is currently
adding true interoperable versioning to WebDAV, and you're probably back to
requiring WebDAV to do the job right.

Lisa Dusseault