[BioPython] large data distribution (was: versioning, distribution and webdav)

Wed, 16 May 2001 23:03:56 -0600

I think I made a mistake in getting too quickly into detail
when I really wanted to keep the discussion more general
and abstract.  Basically, what do you think is needed in
the future for access to bioinformatics data?  What are your
concerns and worries?  What technologies do you think will
be needed and why?

So, take two! :)

Bioinformatics data has been growing exponentially and will
be for the foreseeable future.  It is doubling roughly every
year, which is a rate comparable with the cost of storage
and processing.  The limiting factor is data distribution.

Bandwidth is not growing that quickly.  A T1 (1.5Mbits/sec)
costs roughly $1,000/month and a T3 (45Mbits/sec) costs
about $20,000/month.  The world's publically available
bioinformatics databases currently contain about 100GBytes
of data, which is a week at T1 or 1/4 day at T3.

This implies that in 5 years researchers with only T1
connectivity will not be able to keep up with the full data
stream.  The saturation for T3 is about 10 years, which is too
far in the future to really predict anything.  (That being
two years after the Coffee Riots of '09 :)

I believe this bandwidth limitation is a problem because it
prevents certain people from doing research.  This includes
people like me from a small company but would be true for
anyone who can't afford the extra $240,000 per year for extra
connectivity.

  Conjecture 1:
     As database sizes grow, the number of people working on whole
     database problems (or dense random-access problems) decreases.

I just made up the term "dense random-access."  By "dense" I mean
"want a lot of records" and by "random-access" I mean "not already
organized by one of the database providers."  For example, getting
all of the non-human primate sequences from GenBank is dense
(because there are a lot of records) but it isn't random-access
because GenBank makes it easy to access that subset by downloading
a few files.

Retrieving the records after doing a BLAST search is not dense
because only a few records (<1,000) are needed but it is random
access because those records aren't a priori organized together.

This conjecture is testable, and Ewan could help out with this.
Analysis of the download logs from EBI should provide a rough
measure of diversity.  It most likely includes hostname data for
people who have downloaded EMBL data over the last few years.
This can be processed to get a rough count of the number of
different organizations that want the full database and see how
that has changed over time.

My conjecture says that diversity will decrease, if it hasn't
already.

Here's another test:  Is there anyone here who would like to
download and keep up-to-date all of GenBank/EMBL/SWISS-PROT/PDB
but do not do so because of the time it takes?  I am one.
I have the disk space but not the bandwidth.

Qualifiers to Conjecture 1:

 - I think the numbers are hard to judge without comparison to the
growth in the number of bioinformatics researchers.  After all,
if there are 100 times as many sites doing bioinformatics then
a doubling of the number of downloads really means a 50-fold
reduction and not a doubling.  Perhaps looking at the diversity
of people downloading individual records through the web interface
can be used as a rough measure for the total number of people
interested in EMBL data.

Ewan?  You wouldn't happen to have that data handy, would you?

 - People could receive the data through means other than the
Internet.  GenBank used to ship their data on media, but I recall
they stopped doing that a few years ago.  The PDB still ships
their release on about 6 CDs.  What about the other databases?
And does anyone here get their updates on media?

A few years ago when my then company was looking into being a data
provider there were a few other companies doing or planning on
doing the same.  For example, Molecular Informatics (since bought
by PE, now AB) sold an Oracle database dump of bioinformatics
data, and provided updates for it.

  If companies like this are common then they would skew the
data seen by Ewan.  I get the feeling they aren't very common so
it shouldn't be a problem.

 - There are research problems needing "dense random-access."  My
whole concern is predicated on this, but I want to make the dependency
explicit.  If people only want well defined subsets of the data
("all the human genes" or "everything matching Prosite pattern X")
then the database providers can provide those subsets directly.
This assumes those subsets aren't of comparable size to the
databases as a whole.

I believe there are interesting problems in this area.  They include
projects like free-text search of the database for data mining, or
developing new tests for distant homology.  (As an interesting
note, the first of these projects only needs the non-sequence data
in the records while the second only needs the sequence data and
a unique identifier, which suggests the need for database access
needing only partial records.)

On a related need, some of the companies we talked to expressed
an interest in having the database search services (like BLAST)
in-house for privacy reasons - they didn't want their proprietary
sequences sent outside the company and to a lessor extent didn't
like the possibility for traffic analysis based on the requests.
This implies a need for all of the data to be available in-house.

  Conjecture 2:
     There are technical means to reduce the impact of the slow
     growth in bandwidth and don't require a change in physical
     infrastructure.

By this I mean there are software-only solutions. I outlined
a few of these:

 - Use a better compressor.  Why don't the providers allow
downloads via bzip2?  It compresses better than gzip in all
the tests I've done, at the cost of CPU.  But CPU is not the
limiting factor.  This is emminently testable.

I've also mentioned the possibility of even better compression
using methods which understand the database format instead of
only methods which apply to general-purpose text compression.
This would be a research problem - it can be done and will get
better compression, but I don't know how long it would take to
develop or how good it would be.

 - Is ftp slower than http?  I've heard both yea and nay on this.

 - Enable updating using diff for each record rather than sending
the full file on updates.  This assumes changes to a record are
small.  The Rsynch protocol does this as do others.

 - Let people work with subsets of the data locally while leaving
the rest accessible over the network, perhaps with a proxy using a
large cache.

Despite what I said, I know there are a lot of people who are only
interested in a subset of the database.  Currently they need to
download a lot of large data files to get all of the GCPRs.  If
those records are identified through other means, there should be
a way to store and update just that subset.

As described just now, the subset is 'dead.'  There could be a way
to have a running query on the database provider tell the proxy
system that new records matching the query have been added or
modified.  This allows (dare I say) push technology!

 - Ensure that everything is transparent and scalable.  Amoung
other things this allows the code to be moved to places with better
data locality and get better performance, if needed.

 - Enable ways to only pull the parts of the records that are
directly relevant.  Eg, if I only want the sequence data for local
BLAST searching I can defer looking up the publication information
until I've identified the important records.

 - Provide backwards compatible solutions.  An interesting possibility
are the systems that mount WebDAV access as if they are local files.
Metaphorics has their "Virtual NFS" system which does a similar trick.

In general, WebDAV and systems built on top of it (like Subversion
or the software developed by Lisa's company, Xythos) seem to be the
most appropriate long-term solution.  Given the work involved, I
don't think there is a viable short term solution (as with raw HTTP/1.1
and its ETAGs).

  Conjecture 3:
     This is a major architectural change to the way things work and
     will require at least five years.

 - The data providers (NCBI, EMBL, ExPASy, PDB, etc.) will need to
provide new interfaces beyond the simple "file on an ftp server"
model used now.  Making this sort of change takes time.

In a somewhat pessimistic sense, they also need to be convinced
that this is needed.  Their in-house researchers don't have problems
accessing the data.  It is useful (in a funding sense) to have
external researchers visit the site because it's a tangible way to
show that they provide useful access, esp. as compared to the
intangible "we distributed 5PBytes of data".

Many of these sites have responded to the "too much data" problem
by in essence becoming portals, that is, by having web interfaces
for the standard types of data discovery needed by most researchers.
(This precludes most development of new methods, which is my concern.)
Again from a pessimistic viewpoint, it is a funding advantage for
these sites to say that 2,000 people use their services every day.

(None of this viewpoint is based on real data and no offense is
meant to any of the data providers.  I just know from experience
how hard it was to justify free software development to our grant
providers when they couldn't see the people using the software and
believe providing data freely has the same issues.)

Have people complained to these sites that getting data from them
is becoming too complex?  Or that the web interfaces are incomplete
for the types of research they need to do?

Is there a possibility for a company to provide more accessible
data as an intermediary if the main sites do not provide it?  I
don't think so because of the bandwidth costs, 24/7 uptime
requirement, lack of good revenue model (this isn't the '90s! :)
and the ability to be pushed out of the market if the source
data providers decide to implement the same service.

 - Client libraries need to be developed.  These exist for WebDAV
but are still early products.  We would need experience using them.

 - Servers need to be developed.  This exists in Apache's mod_dav
extension but still we need experience.  To be truely useful we
would need to develop methods which allow any site to become a
database host given just the flat-file data and a bit of work.
(The less the better.)

 - Tools need to be rewritten to allow partial updates.  Most
systems now require a reindex of the full database when an update
arrives.  This is okay if the database is updated every week, but
doesn't work if there are updates every 10 minutes.

 - Old code needs solutions that are backwards compatible.  This
needs to be developed and installed on researcher machines.  (The
last step won't be done until there is good need for it.)

  Conjecture 4:
    Bioinformatics data will continue to grow exponentially.

I think this is true for the forseeable future, although its rate
may decrease slightly now that the human genome is .. whatever
the word is for "released but not finished".  In beta?  :)

My estimate for best-case saturation of a T1 line is 5 years from
now.  That's how long it takes for the total amount of new
bioinformatics data to reach 1.5MBits per second.  By that time
we *must* be switched to a more sophisticated way to get partial
updates as otherwise almost no one will be able to get their data,
excepting those lucky enough to use subsets defined by the
providers.

My estimate for doing a major architecture change is also 5 years.
We need to get started now.

  Conjecture 5:
     There are technical means to increase bandwidth which require
     changes in physical infrastructure.

My calculations earlier assumed T1 costs would stay constant.  They
will decrease and it may become common to have two bonded T1 lines.
This doubles the bandwidth and defers the congestion problem by a year.

I don't think most places will have cheap T3 service any time soon.

"Never underestimate the bandwidth of VW microbus/SUV filled with
tapes/DVDs."  (Choose the alternative based on your age :)
FedEx the data.  Storage costs for removal media are decreasing at
roughly the rate that bioinformatics data sizes are increasing. 
The data delta now is a DVD every couple months, and a tape would
be more scalable over time.

There are problems with this.  Socially I know people don't update
data when shipped to them.  I've seen the CDs just sitting around
because no one wanted to take the time to update them.  There were
many reasons including
  - takes time
  - things done once a month need to be relearned (including things
      like finding and remounting the removable drive)
  - "not my job"
  - updates break things, or at least change results
  - need to find more disk space
Technically it's a problem because keeping up with advances in
storage technology requires buying new drives.  OTOH, they can also
be used for backups.

Any new software architecture should allow data to be integrated
from local resources in addition to/instead of from the net.
Otherwise it precludes updating from tape.

  Conjecture 6:
     Data providers will increasingly become service providers.

This is an easy conjecture to prove since in its weakest form
it means the data providers will add more web search capabilities.
Let me strengthen it somewhat.

When all the data and software is easy to download and install,
people use them locally.  The data is increasingly harder to
download so there is less incentive to make it easy to install
the programs that work with the data.  (I once wrote a system
that installs BLAST, FASTA, DSC, Prosite searching, keyword
searches and does automatic updates all in a single configure
and install script, so that's an existance proof that it's doable.)

Since they have easy access to the data and the people who can
figure out how to compile the bits of software, that makes it
pretty natural for them to write servers, and there's no reason
for them not to continue to do so.

What will happen is an increase in the diversity of the servers.
CORBA and WebDAV are two type of new services.  I mentioned
another: an automated query system that notifies clients that
new or updated data is available for transfer.  (PubMed does
something like this for literature services, and there are
companies like Entigen which do this for BLAST searches.)

Still, all these are prepackaged services, so here's where I
go out on a limb:

  Conjecture 6.5:
     This includes colocation - data providers will start
     allowing any (sandboxed) code to run on local machines.

This could also be named the "If Mohammad cannot go to the
mountain..." conjecture :)

This is just like the good old days when everyone had accounts
on the various machines which had the right resources (CPU,
memory, graphics, printer, ....) except that then everyone was
invited to the main house but now visitors will be resticted
to the guest house behind the gate.

This is possible because 1) hardware is cheap, and 2) search
results are small.

Hardware is cheap.  The data providers could make machines
accessible to the rest of the world to log in (via SSH) and use.
(And I do mean the rest of the world: it's easy to get a
sourceforge account, which includes an SSH login.)

Search results are small.  A BLAST search scans the whole
database and people want just the top (say) 100 matches.

So let people write their database search engines and put them
on a machine at the database provider's site.  Let them run
it as a web or CORBA server so it can be integrated with the
rest of the systems back at their site.

May need to put some quotas on that machine.  Might want to
modify libc to disallow a few functions.  Do *not* mention
to me a requirement that only Java be allowed.  I mean it.

In order for this to work nicely it is important to have
a common set of APIs that can be used across the different
sites.  This allows people to develop things locally then
move the code to the colo site and expect things still work.
(Eg, that data be accessible using the same protocols, the
only difference being better performance.)

This is similar to the reason POSIX code can be easily ported
between the different Unix and Unix-like implementations.

  Conjecture 7:
     People will discover new forms of collaboration with
     whatever technology is introduced.

This is again a gimme conjecture - people always find new ways
to collaborate.   I'll just point out two possible examples:

  - WebDAV lets people from different locations update a server.
Suppose you are curating a database.  Presently you need access
to Oracle or whatever DBMS stores the primary data.  That pretty
much requires local people to be curators and discourages
distributed curation.  Those that are distributed are through
clumsy HTML-based solutions or specialized client programs.
  WebDAV provides a frameworks for publishing to the server and
so makes it much easier to develop tools for distributed curation.

  Another way to think about this is that CVS and RCS both do
version control and work almost equally well for local machines
(where everyone can share the RCS directory over NFS).  But
CVS enables distributed development that just is not possible
with RCS.

  - Suppose I develop a cool new search program.  I can drop it
on one of these colo machines and not have to worry about the
data access part (which I judge to be 1/2 of the development
cost).  This makes it easier to develop.  Because the site
also has ways to turn programs into servers, and because the
machine is readily available site, I can set things up for
others try it out and get feedback.

  Conjecture 8:
     I need to get back to work on my client's code so I can get paid.

I think this is self-evident :) 

                    Andrew
                    dalke@acm.org

P.S.
  I haven't even considered the problems the data providers
will have when 200 different sites want a full update using their
T1 lines.  What's the outgoing bandwidth at EBI?