[BioPython] large data distribution (was: versioning, distribution and webdav)
Rune Linding - EMBL
linding@EMBL-Heidelberg.DE
Thu, 17 May 2001 12:53:24 +0200
On Thu, May 17, 2001 at 04:07:56AM -0600, Andrew Dalke wrote:
> BTW, let me add as well the good points about it, which I
> was negligent in doing in my first response.
>
> - It is meant for large systems well beyond any problems
> we face now or the next ten years. (When I mentioned to a friend
> of mine, a nuclear physicist at Michigan State, the data sizes
> we deal with, he pointed out his data sets were 3 orders of
> magnitude larger.)
>
> - It combines work from different fields, which allows for
> a cross-pollenization not possible from a single field.
>
> - It has more resources behind it (9.8 billion euro) than anything
> I could imagine. Perhaps I need more expensive tastes!
>
> - It has a conceptual framework for how such systems might
> be designed, and people with experience in doing so.
>
>
> Then to repond to Rune Linding follow-up of my comments:
> >> - I really didn't like the word 'grid' when I heard it a few
> >> years back as GRiD. But then, I didn't like "www" so I'm
> >> not the best judge of this.
> >
> >i think this is irrelevant
>
> Yep. Completely agree. Names by themselves don't mean anything.
>
> But the word "grid" when used in this context has specific meanings
> related its use in related national and international projects
> and its connotations to the "power grid" including a history going
> back to Multics, which was the first project I heard of where
> people talked about the metaphor for plugging into a distributed
> network to share resources. This project and metaphor is well-known
> in the text books.
>
> If a project accepts that word then I feel they must address those
> relationships. For example, in some sense Multics failed because
> it never produced a really commerically viable machine. On the
> other hand, it succeeded wildly as a research topic, and Unix
> is one of its descendents, if only as a reaction against Multics.
>
> So if the DataGrid is specifically alluding to that history,
> then I would urge caution for anyone who plans to depend on its
> results within the next five years, which is what I see as the
> required timeline. On the other hand, promoting it as a research
> platform is a different topic entirely. I want that first.
>
i still dont find it relevant :)
> >> - their current site is poorly designed
>
> >iam sorry but i dont see that using flash or placing a 4meg pdf document
> >is a problem.
>
> I made that statement for several reasons. The first was purely
> an observation based on knowing nothing about the site at this point.
> By making it I caution others that that site may be hard to visit.
> In addition I hope to encourge others that good web design is
> important if only not to get comments like mine.
>
> In a larger sense, it suggests several views I consider negative.
> - usability is of relatively low concern
i dont have a problem extracting the info i want from their site
>
> - they have no problems using proprietary technologies (flash)
> Plus, they are using it for a navigation bar! Without it the
> nav bar is broken.
the hardware in front of you is proprietary technologies.....but iam not a flash supporter :)
and i i didnt give you this link to discuss opensource isssues...really
>
> - they expect people to be connected to the web through large pipes
> (a 4MB PDF with no caution as to its size)
i totally disagree, 4 megs is nothing :)
>
> Of these, I consider the first most problematical. If this is the
> main entry point for the project then it suggests that other parts
> of the project - like the code - could be just as unusable.
>
> Yes, this is all inferred. But they are all so easy to fix that
> leaving the site in this poorly usable state brings up its own
> set of organizational implications.
>
> >> - It falls in the category of Grand Projects, and I've found that
> >> those require a lot of time, research, politics and money.
>
> >well what you are talking about solving involves highspeed networking,
> offshore multiplexing, new routing methods etc and YES its expensive, takes
> time etc.
>
well to conclude my part of this discussion:
if you want a NMR structure, you need at least a NMR spect. or access to one....
i dont see any diff. from this and the transit technology you need for moving data....if you need to process gigs of data you need a fiber in someway...science is costly
for the record:
at EMBL we are around 1000 people..all using w3 and mail etc.
on top of this we run some pretty important databases all this is on a e3 (34megbit)...and we are not maxed out at any time...(Well .)
> Really? My goal is to allow people like me to continue working
> with bioinformatics data. I set a pretty firm limit that a
> solution must use technology likely available in 5 years and be
> accesible to small research and development groups who can spend
> at *most* about $20,000 per year for access to bioinformatics data.
> The highest speed I considered possible was two bonded T1 lines, for
> 3MBit/sec total.
>
> If I can throw money at the problem then getting a T3 line, at
> $240,000/yr, meets all of my data transfer needs for the next
> decade all with existing technology. (I do not attempt to make
> preditions more than 10 years in the future.)
>
> On the other hand, if you are correct and it requires highspeed
> networking, etc. then there's no way I can stay in this field as
> an independent company because I just can't afford it.
>
> I happen to think that the existing infrastucture can be
> made much more efficient, in order to allow a much longer use
> of existing networks but also to enable to new sorts of
> collaboration and development models. This I can afford doing.
>
> I could be wrong, which is why I proposed everything as a
> set of conjectures, with some ability to test or measure the
> validity of most of them.
>
> >but so was the creation of the www backbone as its standing today
>
> The commercial internet - the reason I am able to be an independent
> company and live where I do - took 20 years to develop. I predict
> a solution is needed before then, so there needs to be a way to
> extend viability until the next generation networking (dense fiber
> multiplexing, all optical switches, fiber to my house, and all
> those cool things) are common.
>
> >> - I don't think anything they are proposing would work over a
> >> T1 line,
>
> >ofcourse not, its a new backbone structure and i think some of the
> >technology will come from projects like myrinet who is deeply
> >involved in cluster networks
>
> I don't have the resources to invest in hardware, only thought.
> Hence for me that is not a project in which I can participate.
>
> >you have to differ between accessing a backbone(which move the big
> >data around) and accessing the data....the last is a matter of
> >structuring data and creating good interfaces...the first is a
> >matter of fiber and cisco's :)
>
> This was part of one my conjectures -
>
> Conjecture 6.5:
> This includes colocation - data providers will start
> allowing any (sandboxed) code to run on local machines.
>
> specifically
>
> Search results are small. A BLAST search scans the whole
> database and people want just the top (say) 100 matches.
>
> So let people write their database search engines and put them
> on a machine at the database provider's site. Let them run
> it as a web or CORBA server so it can be integrated with the
> rest of the systems back at their site.
>
> Sincerely,
>
> Andrew
> dalke@acm.org
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
--
Rune Linding linding@gandalf:~$
EMBL - Biocomputing Unit (Gibson Team,v105) phone +49 (0)6221 387451
Meyerhofstrasse 1 fax +49 (0)6221 387517
D-69117 Heidelberg mobile +49 (0)1794 629313
Deutschland home +49 (0)6221 1371261