[BioPython] large data distribution (was: versioning, distribution and webdav)

Thu, 17 May 2001 12:53:24 +0200

On Thu, May 17, 2001 at 04:07:56AM -0600, Andrew Dalke wrote:
> BTW, let me add as well the good points about it, which I
> was negligent in doing in my first response.
> 
> - It is meant for large systems well beyond any problems
> we face now or the next ten years.  (When I mentioned to a friend
> of mine, a nuclear physicist at Michigan State, the data sizes
> we deal with, he pointed out his data sets were 3 orders of
> magnitude larger.)
> 
> - It combines work from different fields, which allows for
> a cross-pollenization not possible from a single field.
> 
> - It has more resources behind it (9.8 billion euro) than anything
> I could imagine.  Perhaps I need more expensive tastes!
> 
> - It has a conceptual framework for how such systems might
> be designed, and people with experience in doing so.
> 
> 
> Then to repond to Rune Linding follow-up of my comments:
> >>  - I really didn't like the word 'grid' when I heard it a few
> >>      years back as GRiD.  But then, I didn't like "www" so I'm
> >>      not the best judge of this.
> >
> >i think this is irrelevant
> 
> Yep.  Completely agree.  Names by themselves don't mean anything.
> 
> But the word "grid" when used in this context has specific meanings
> related its use in related national and international projects
> and its connotations to the "power grid" including a history going
> back to Multics, which was the first project I heard of where
> people talked about the metaphor for plugging into a distributed
> network to share resources.  This project and metaphor is well-known
> in the text books.
> 
> If a project accepts that word then I feel they must address those
> relationships.  For example, in some sense Multics failed because
> it never produced a really commerically viable machine.  On the
> other hand, it succeeded wildly as a research topic, and Unix
> is one of its descendents, if only as a reaction against Multics.
> 
> So if the DataGrid is specifically alluding to that history,
> then I would urge caution for anyone who plans to depend on its
> results within the next five years, which is what I see as the
> required timeline.  On the other hand, promoting it as a research
> platform is a different topic entirely.  I want that first.
> 
i still dont find it relevant :)

> >>  - their current site is poorly designed
> 
> >iam sorry but i dont see that using flash or placing a 4meg pdf document
> >is a problem.
> 
> I made that statement for several reasons.  The first was purely
> an observation based on knowing nothing about the site at this point.
> By making it I caution others that that site may be hard to visit.
> In addition I hope to encourge others that good web design is
> important if only not to get comments like mine.
> 
> In a larger sense, it suggests several views I consider negative.
>   - usability is of relatively low concern

i dont have a problem extracting the info i want from their site

> 
>   - they have no problems using proprietary technologies (flash)
>      Plus, they are using it for a navigation bar!  Without it the
>      nav bar is broken.

the hardware in front of you is proprietary technologies.....but iam not a flash supporter :)
and i i didnt give you this link to discuss opensource isssues...really

> 
>   - they expect people to be connected to the web through large pipes
>       (a 4MB PDF with no caution as to its size)

i totally disagree, 4 megs is nothing :)

> 
> Of these, I consider the first most problematical.  If this is the
> main entry point for the project then it suggests that other parts
> of the project - like the code - could be just as unusable.
> 
> Yes, this is all inferred.  But they are all so easy to fix that
> leaving the site in this poorly usable state brings up its own
> set of organizational implications.
> 
> >>  - It falls in the category of Grand Projects, and I've found that
> >>     those require a lot of time, research, politics and money.
> 
> >well what you are talking about solving involves highspeed networking,
> offshore multiplexing, new routing methods etc and YES its expensive, takes
> time etc.
> 

well to conclude my part of this discussion:

if you want a NMR structure, you need at least a NMR spect. or access to one....
i dont see any diff. from this and the transit technology you need for moving data....if you need to process gigs of data you need a fiber in someway...science is costly

for the record:
at EMBL we are around 1000 people..all using w3 and mail etc.
on top of this we run some pretty important databases all this is on a e3 (34megbit)...and we are not maxed out at any time...(Well .)

> Really?  My goal is to allow people like me to continue working
> with bioinformatics data.  I set a pretty firm limit that a
> solution must use technology likely available in 5 years and be
> accesible to small research and development groups who can spend
> at *most* about $20,000 per year for access to bioinformatics data.
> The highest speed I considered possible was two bonded T1 lines, for
> 3MBit/sec total.
> 
> If I can throw money at the problem then getting a T3 line, at
> $240,000/yr, meets all of my data transfer needs for the next
> decade all with existing technology.  (I do not attempt to make
> preditions more than 10 years in the future.)
> 
> On the other hand, if you are correct and it requires highspeed
> networking, etc. then there's no way I can stay in this field as
> an independent company because I just can't afford it.
> 
> I happen to think that the existing infrastucture can be
> made much more efficient, in order to allow a much longer use
> of existing networks but also to enable to new sorts of
> collaboration and development models.  This I can afford doing.
> 
> I could be wrong, which is why I proposed everything as a
> set of conjectures, with some ability to test or measure the
> validity of most of them.
> 
> >but so was the creation of the www backbone as its standing today
> 
> The commercial internet - the reason I am able to be an independent
> company and live where I do - took 20 years to develop.  I predict
> a solution is needed before then, so there needs to be a way to
> extend viability until the next generation networking (dense fiber
> multiplexing, all optical switches, fiber to my house, and all
> those cool things) are common.
> 
> >>  - I don't think anything they are proposing would work over a
> >>      T1 line,
> 
> >ofcourse not, its a new backbone structure and i think some of the
> >technology will come from projects like myrinet who is deeply
> >involved in cluster networks
> 
> I don't have the resources to invest in hardware, only thought.
> Hence for me that is not a project in which I can participate.
> 
> >you have to differ between accessing a backbone(which move the big
> >data around) and accessing the data....the last is a matter of
> >structuring data and creating good interfaces...the first is a
> >matter of fiber and cisco's :)
> 
> This was part of one my conjectures -
> 
>   Conjecture 6.5:
>      This includes colocation - data providers will start
>      allowing any (sandboxed) code to run on local machines.
> 
> specifically
> 
>    Search results are small.  A BLAST search scans the whole
>    database and people want just the top (say) 100 matches.
> 
>    So let people write their database search engines and put them
>    on a machine at the database provider's site.  Let them run
>    it as a web or CORBA server so it can be integrated with the
>    rest of the systems back at their site.
> 
> Sincerely,
> 
>                     Andrew
>                     dalke@acm.org
> 
> 
> 
> _______________________________________________
> BioPython mailing list  -  BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
--
Rune Linding 					linding@gandalf:~$
EMBL - Biocomputing Unit (Gibson Team,v105) 	phone 	+49 (0)6221 387451
Meyerhofstrasse 1 				fax 	+49 (0)6221 387517
D-69117 Heidelberg 				mobile 	+49 (0)1794 629313
Deutschland 					home 	+49 (0)6221 1371261