[BioSQL-l] Exploring data in database

Wed Mar 26 13:13:12 EST 2003

I've been through this before. For what it's worth, here's what I 
experienced and learned.

	- Lab/Bench scientists will not learn SQL in order to type in queries. 
(In fact I fully agree with them that that's a total waste of their 
time and skills.)

	- They will not build queries against a fully normalized database. 
(Again, I fully agree this is wasting their time.)

	- They will want to run simple queries first, but very quickly will 
want to drill down and run complex and unpredictable queries. Writing 
tools to support hard-coded queries is not going to cut it, and writing 
tools to flexibly and visually build complex queries is a reinvention 
of a well-invented (though commercial) wheel, and IMHO hence a total 
waste of the Bioinformaticists' time.

	- These days most bench scientists are at least moderately 
computer-savvy. Given willingness and curiosity for their data, I 
haven't encountered lab scientists who could not be taught to build 
queries visually in MS Access on 1 table or joining 2 tables. It also 
didn't take them long to even run aggregate queries.

So, for what it's worth, here's what I would do.

1) Scrap MySQL, use PostgreSQL or any other serious RDBMS that honors 
the fact that there are (normalized) models and (denormalized) views.

2) Design a proper relational data model (that is nice to you as the 
data maintainer, not nice to the scientists as data viewers or miners), 
then write views on top of it that most closely resemble the data types 
in which your lab scientists think.

3) Hand out plain descriptions of your views (just name and column 
names) and ask your scientists whether it makes sense to them. If not, 
you need to improve or rewrite the views until it does. The lab 
scientists have actually a very good idea what data they're working 
with. You just need to match it. Your views will be denormalized. 
Provide one view for each data entity (not relational entity), and 
provide views that pre-join commonly joined datatypes.

4) Set up MS Access and the ODBC driver for your RDBMS on their 
desktops. I suppose the former is done already, and the second is 
usually not very hard.

5) Expose only the views, not the tables, to be visible when they link 
tables in Access. (Provided the RDBMS allows you to do this. Otherwise 
prefix all views, and advise to look only at 'tables' that start with 
the prefix. The objective is to avoid upfront confusion as much as 
possible, which leads to much quicker acceptance.)

6) Watch out for performance problems. Since you wrote the views, you 
are in good control of performance tuning, which is a nice side effect. 
If the view just doesn't cut it performance-wise, materialize it (which 
will be invisible to the users, and won't harm the relational model).

I did this strategy before and it worked out very well in a relatively 
conservative environment of people. What is key is that you tie in one 
or two of the lab scientists early, first to get the views right, and 
second to get a spear head on their end who are going to pull over the 
less enthusiastic. Usually the one or those who devised the experiments 
are most keen to assume this responsibility (as it gives them early 
access to their data).

My gut feeling is that the EnsMart approach works best only if a) your 
user community is huge (because then the strategy previously described 
may quickly require too much support resources), and b) you can define 
a set of canned ways to look at the data and retrieve results, and c) 
you  know that your definitions are alright and are going to cut it 
because those have been the main ways people looked at the data for the 
last year.

My $0.02.

	-hilmar

On Wednesday, March 26, 2003, at 06:55  AM, Kris Boulez wrote:

> [ I know this question is rather off topic, but I don't know of a more
> appropriate mailing list. In the end it's about a database with
> sequences in, so it's not completely OT :)    ]
>
>
> We (will) have a database that tracks the results of different
> experiments in the lab (you could call it a very mini LIMS).
> It will most probably be implemented in MySQL.
>
> The users (scientists) want to look at the data in a very 'exploratory'
> way (e.g. "given this gene, coming from experiment Y, which other genes
> have a similar pattern in experiment X and ... "). They want to query
> the data but have no predefined queries whatsoever.
>
> To allow these people to do this, my standard answer is to learn SQL,
> have a good look at the database schema and write their queries
> themself. They seem however very reluctant to do this.
>
> I can think of alternatives:
>
> - use MS Access and access the database via ODBC (their seems to be a
>   relatively intuitive way of creating queries/reports in Access)
>
> - dump the tables "tab seperated" and look at them in MS Excell.
>
> - implement something like EnsMart
>
> I would however prefer a (web) environment, which lets them explore the
> data. I looked around a bit for this, but couldn't really find 
> anything.
>
>
> Suggestions, comments ?
>
>
> Kris,
> -- 
> Kris Boulez 				Tel: +32-9-241.11.00
> AlgoNomics NV 				Fax: +32-9-241.11.02
> Technologiepark 4 			email: kris.boulez at algonomics.com
> B 9052 Zwijnaarde 			http://www.algonomics.com/
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------