[Fwd: RE: [BioPython] versioning, distribution and webdav (fwd)]

Andrew Dalke dalke@acm.org
Tue, 15 May 2001 22:32:28 -0600


First, in case there were any questions, let me comment that
my message was an expression of concern for the future of
bioinformatics research, with an example of ways to reduce
my concern.  Any specific technology was meant as an example
of a class of technologies.

That said, Lisa Dusseault <lisa@xythos.com>:
>Take a look at the RSync work and the RProxy designs which are for using
the
>RSync algorithm over HTTP.

Sigh.  And for some reason I feel oligated to do quick followups
when I know I don't have the time.  This meant today I had to actually
read the documents for WebDAV and related work :)

I took a look at
 http://www.sharemation.com/~milele/public/rsync-specification.htm
The "Requirements and Scenarios" talks explicitly about WebDAV so
at least I'm not too far off of a reasonable approach.

> I've been exposed to a bunch of this kind of
>thing over the years, and I like the rsync algorithm best for most synch
>situations I've seen, because most synch situations involve limited space.
>(A counter-example would be a source-code repository, where old versions of
>all files are, or should be, kept around: in this situation another
>algorithm may be preferable to reduce processing costs.)

Subversion is the version control system I was thinking of which is
built on top of WebDAV (see http://subversion.tigris.org/ ), claims to
allow client-side plug-in diff programs but then goes on to say in
 http://www.tigris.org/files/documents/15/48/svn-design.html
that they have their own text diff program.  Rsynch also uses a
fingerprint to identify portions that have changed.  But perhaps
that can be done with some unique identifer token stored as some
property?

I'm more than willing to say that I don't know what's appropriate,
only that this is a direction we (as people who use bioinformatics
data) need to explore and likely use.  So I think you very much for
your comments.

>Absolutely.  But, despite my own bias towards WebDAV (Xythos builds a
WebDAV
>server and web document storage platform called Xythos WebFile Server),
>frankly HTTP may serve your purpose as well as WebDAV.  Either one would
>need some extensions to be truly great at this application.

Isn't WebDAV built on top of HTTP?  Perhaps I'm missing something here.

>users need
>only download those with changed ETags.  (See the HTTP 1.1 spec,
>http://www.ietf.org/rfc/rfc2616.txt, section 14.19).

Ah, I see what I'm missing - your statement (not quoted) about not
necessarily needing extensions on top of HTTP 1.1 because that spec
has a way to tell if an entity is up-to-date.

One of the things I do want is diff support since oft times the
changes are minor - maybe a couple lines in a file might change.
(Don't have good statistics on this, only a hunch.)

>at least if the files are broken
>up intelligently, users would only have to download files that had changed.

It would be nice to get the full list of changes with one exchange
rather than piecemeal, which appears to be a feature of subversion
but would have to be built on top of stock HTTP 1.1 (as I understand
it - again, educated guesswork.)

There are times when it would be nice to use different versions of
the same record, eg, from two different releases of the same database.
That couldn't be done with this proposal.

'Course, this means I really should write a requirements draft ..
did I mention something about not having time?

>Also, there is work to extend HTTP to use diff algorithms to download
>smaller pieces in cases where files are changed:
>http://www.ietf.org/internet-drafts/draft-mogul-http-delta-08.txt.

>And linux/*ix -- mod_dav is free module for Apache.  Yes, some WebDAV
>servers expose the existing file system as WebDAV collections and
resources.
>If there is limited need for extensive metadata, these solutions are super.

Most programs don't know how to use URLs, only files, so this is a
nice feature.

>It depends to a
>large extent on how the database is currently structured -- if it currently
>doesn't have a way to mark data changing (a modified-date, an etag, diffs,
>or some such mechanism) then it would be much harder to put in place a
>proxy/caching server.

That would be for the database folks to answer.  My impression is that
it's stored in some DBMS and dumped to file for export for each
release.  If that data was "live", meaning tied directly to the server
then it would also be possible to map record elements to WebDAV
properties.  Hmm, need more time for thought on this - and practice
being a database vendor instead of recepiant.

>I agree.  In fact, WebDAV may be too heavyweight for this specific task.

Kinda depends on what's needed :)

>yet it's hard to know what the right one is for all situations (given
>tradeoffs between processing power, storage, and bandwidth availability).

Oh, that's easy.  There isn't a right one.  It's just like compression
algorithms as used for transport.  Let it be extensible, define one or
two as the de facto MUST be accepted (making sure the source is public
domain or as close as possible) then in a couple of years when people
tell you that you were wrong, add the accepted one and deprecate the
old ones, to be removed in about 20 years.

>The subject of this email mentions "versioning".  I don't know how
important
>it is to actually have old versions around -- do people really want to be
>able to go back to old versions?

I think it is for some times of research, but the requirement for this
is at least two orders of magnitude less than just being able to get
the most recent.  But there is a need so I don't want to preclude it.

> if it is necessary, [...] you're probably back to
 > requiring WebDAV to do the job right.

Okay.

Thanks again for all your comments!

Anyone want to pay me to work on this?  Or on biopython in general :)

                    Andrew
                    dalke@acm.org