[BioPython] large data distribution (was: versioning, distribution and webdav)

Andrew Dalke dalke@acm.org
Thu, 17 May 2001 04:07:56 -0600


BTW, let me add as well the good points about it, which I
was negligent in doing in my first response.

- It is meant for large systems well beyond any problems
we face now or the next ten years.  (When I mentioned to a friend
of mine, a nuclear physicist at Michigan State, the data sizes
we deal with, he pointed out his data sets were 3 orders of
magnitude larger.)

- It combines work from different fields, which allows for
a cross-pollenization not possible from a single field.

- It has more resources behind it (9.8 billion euro) than anything
I could imagine.  Perhaps I need more expensive tastes!

- It has a conceptual framework for how such systems might
be designed, and people with experience in doing so.


Then to repond to Rune Linding follow-up of my comments:
>>  - I really didn't like the word 'grid' when I heard it a few
>>      years back as GRiD.  But then, I didn't like "www" so I'm
>>      not the best judge of this.
>
>i think this is irrelevant

Yep.  Completely agree.  Names by themselves don't mean anything.

But the word "grid" when used in this context has specific meanings
related its use in related national and international projects
and its connotations to the "power grid" including a history going
back to Multics, which was the first project I heard of where
people talked about the metaphor for plugging into a distributed
network to share resources.  This project and metaphor is well-known
in the text books.

If a project accepts that word then I feel they must address those
relationships.  For example, in some sense Multics failed because
it never produced a really commerically viable machine.  On the
other hand, it succeeded wildly as a research topic, and Unix
is one of its descendents, if only as a reaction against Multics.

So if the DataGrid is specifically alluding to that history,
then I would urge caution for anyone who plans to depend on its
results within the next five years, which is what I see as the
required timeline.  On the other hand, promoting it as a research
platform is a different topic entirely.  I want that first.

>>  - their current site is poorly designed

>iam sorry but i dont see that using flash or placing a 4meg pdf document
>is a problem.

I made that statement for several reasons.  The first was purely
an observation based on knowing nothing about the site at this point.
By making it I caution others that that site may be hard to visit.
In addition I hope to encourge others that good web design is
important if only not to get comments like mine.

In a larger sense, it suggests several views I consider negative.
  - usability is of relatively low concern

  - they have no problems using proprietary technologies (flash)
     Plus, they are using it for a navigation bar!  Without it the
     nav bar is broken.

  - they expect people to be connected to the web through large pipes
      (a 4MB PDF with no caution as to its size)

Of these, I consider the first most problematical.  If this is the
main entry point for the project then it suggests that other parts
of the project - like the code - could be just as unusable.

Yes, this is all inferred.  But they are all so easy to fix that
leaving the site in this poorly usable state brings up its own
set of organizational implications.

>>  - It falls in the category of Grand Projects, and I've found that
>>     those require a lot of time, research, politics and money.

>well what you are talking about solving involves highspeed networking,
offshore multiplexing, new routing methods etc and YES its expensive, takes
time etc.

Really?  My goal is to allow people like me to continue working
with bioinformatics data.  I set a pretty firm limit that a
solution must use technology likely available in 5 years and be
accesible to small research and development groups who can spend
at *most* about $20,000 per year for access to bioinformatics data.
The highest speed I considered possible was two bonded T1 lines, for
3MBit/sec total.

If I can throw money at the problem then getting a T3 line, at
$240,000/yr, meets all of my data transfer needs for the next
decade all with existing technology.  (I do not attempt to make
preditions more than 10 years in the future.)

On the other hand, if you are correct and it requires highspeed
networking, etc. then there's no way I can stay in this field as
an independent company because I just can't afford it.

I happen to think that the existing infrastucture can be
made much more efficient, in order to allow a much longer use
of existing networks but also to enable to new sorts of
collaboration and development models.  This I can afford doing.

I could be wrong, which is why I proposed everything as a
set of conjectures, with some ability to test or measure the
validity of most of them.

>but so was the creation of the www backbone as its standing today

The commercial internet - the reason I am able to be an independent
company and live where I do - took 20 years to develop.  I predict
a solution is needed before then, so there needs to be a way to
extend viability until the next generation networking (dense fiber
multiplexing, all optical switches, fiber to my house, and all
those cool things) are common.

>>  - I don't think anything they are proposing would work over a
>>      T1 line,

>ofcourse not, its a new backbone structure and i think some of the
>technology will come from projects like myrinet who is deeply
>involved in cluster networks

I don't have the resources to invest in hardware, only thought.
Hence for me that is not a project in which I can participate.

>you have to differ between accessing a backbone(which move the big
>data around) and accessing the data....the last is a matter of
>structuring data and creating good interfaces...the first is a
>matter of fiber and cisco's :)

This was part of one my conjectures -

  Conjecture 6.5:
     This includes colocation - data providers will start
     allowing any (sandboxed) code to run on local machines.

specifically

   Search results are small.  A BLAST search scans the whole
   database and people want just the top (say) 100 matches.

   So let people write their database search engines and put them
   on a machine at the database provider's site.  Let them run
   it as a web or CORBA server so it can be integrated with the
   rest of the systems back at their site.

Sincerely,

                    Andrew
                    dalke@acm.org