[Bioperl-l] Bio::FeatureHolderI

Matthew Pocock matthew_pocock@yahoo.co.uk
Mon, 18 Nov 2002 23:55:59 +0000


Ewan Birney wrote:
> 
> Agreed (on both points). We should definitely do this post 1.2...
> 

I strongly agree with this. As the one who is writing the biojava query 
optimizer, I caution against adding queries to 1.2 - you will need a 
minimum of 6 months to get it even aproximately right if you intend to 
do query planning. There are subtle issues with queries - are you 
matching top-level features only, or are you traversing the feature 
tree? Will you allow user-defined filters? Are you returning a set of 
interesting features, or are you returning cut-down feature hierachies? 
Do you want to allow an entire xpath-style tree searching language, or 
stick to simply filtering features by their own properties? Do you want 
to be able to filter one feature by the properties of another (e.g. 
repeat that overlaps an exon) or is this out-of-bounds data (e.g. make 
location of exon positions, filter by and(repeat, overlaps location)? 
Are these things intended for people or for computers? Do you have any 
intention of passing these over the wire? Can they be applied to any 
'FeatureHolderI' or just to sequences, or just to features, or perhaps 
to entire sequence DBs? Do feature holders know anything about the 
filters that would hit their sequences? Will this play well with some 
extended querying capabilities e.g. could something functionaly 
equivalent to 
seqDB.filterSequences(byHasGoTerm(foo)).filterFeatures(exons) be 
expresssed as a single query object (graph, hierachy, what ever)? With 
queries, meta-data is everything. With out it, you end up with 
unmaintainable spageti.

Above all, do you realy want to do this all yourself, from scratch, for 
every data-type people may be interested in, or do you want to off-load 
this to the new ontology stuff and get someone clever like ChrisM to 
write code to generate the query framework from a propper ontological 
deffinition of the bioperl objects? Sounds scarey, but believe me, you 
will not want to maintain query code for everything people want to query.

So, above all, I would sudgest comming up with a page full of queries 
you would like to express (like find exons) and things you think could 
be optimized (like we're in ensembl, just scan the exon table). Start 
thinking /seriously/ about your meta-data, and formalize this so that 
you have an object model for representing meta-data. Read good books on 
lambda-calculus, prolog and the dragon compiler book. Learn something 
like Hascal, lisp and prolog. Learn SQL & AQL. Decide if the queries are 
based arround text (like sql) or are syntax-trees (like in xquery). 
Then, and only then, start to code this up.

Or, add a feature method to FeatureHolderI tonight with a FilterI with 
accept(feature) and then learn all this after. That's what we did.

One other thing, what ever query code you end up with, it's likely to 
have a high bus quotient (no. of people using the code) / (no. of people 
who could be hit by a bus and the code still get maintained). I don't 
realy see how that's avoidable. This stuff falls into the same category 
as writing DP code generators & fast matrix math libraries.

Welcome to semantic hell.

Matthew

ps did I mention that you need meta-data?
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 


-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com