[Open-bio-l] bioentry_direct_links transitive?
Hilmar Lapp
hlapp@gnf.org
Wed, 1 May 2002 12:00:08 -0700
In essence that table establishes an association between bioentries, right? Do you treat such associations as transitive, and if so, do you reflect this in the db by multiplying the associations? I.e., if entry A has a cross-ref to entry B, and B has one to C, do you assume that A is in fact cross-referencing C, too, and if so, do you physically add this association?
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp@gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
> -----Original Message-----
> From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> Sent: Wednesday, May 01, 2002 9:45 AM
> To: Hilmar Lapp
> Cc: GMOD Devel (E-mail); OBDA BioSQL (E-mail)
> Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes &
> features&mappings to assemblies
>
>
>
> On Tue, 30 Apr 2002, Hilmar Lapp wrote:
>
> > I like this proposal, but ...
> >
> > > 9. project-centric column names like "chromosome" are avoided; eg
> > > drosophila has chromosome arms as top level sequences
>
> really this is just a column naming issue, it doesn't affect
> the structure
> of the schema at all.
>
> > So, I'm still confused about how I am supposed to store gene
> > predictions, EST, RefSeq, or whatever mappings to chromosomes in an
> > assembly, such that I can answer queries like 'show me all exons of
> > genes and their lines of evidence that map between markers
> X and Y on
> > chromosome 5 of mouse. Next, show how human genes map to
> this region,
> > and which human chromosomes.'
>
> Ignoring assemblies for a second, genomic alignments of EST or RefSeq
> sequences should use the (not yet present) featurepair table,
> (or whatever
> we decide to name this table).
>
> The exons would be stored as seqfeatures, as should the
> markers (presuming
> you've mapped them to base pair coordinates)
>
> assuming flat features i.e. no-assembly, you just select all
> seqfeatures
> and featurepairs between X and Y, which will give you mouse genes and
> associated evidence.
>
> Amongst these feature pairs will be hits to RefSeq human.
> Let's say you're
> storing both mouse and human in the same db, and you've run blast or
> whatever on both mouse & human chromosomes vs RefSeq. You can
> do a query
> joining bioentry to taxon (human only), and 2 joins of
> featurepair to get
> the human sequences that match to both human and mouse.
>
> It's a bit of a pain going through bioentry all the time, there is an
> argument for collapsing bioentry and biosequence into the
> same table, it
> would reduce a lot of joins.
>
> That's one way; there are multiple ways to do it depending on how your
> project works. Maybe you store RefSeq IDs as dbxrefs linked to your
> seqfeature gene entries. There are a number of ways of determining the
> 'how human genes map to this region' part of your example query (e.g.
> large scale synteny blocks, blastp results on the peptide
> level, curated
> homologies...), I think all of them are possible in BioSQL in
> one way or
> another.
>
> Now if we throw assemblies into the equation, it gets a
> little trickier.
>
> With the solution I outlined in an earlier email, you can do the exact
> same query as you would above, using views to dynamically project your
> features-on-contigs to features-on-chromosomes.
>
> This could be slow, and it wouldn't be possible in MySQL.
>
> However, if you were prepared to do all your data mining type
> queries on a
> warehouse db recreated nightly, you could still use MySQL,
> you just need
> to materialise the views. This is effectively what i do with GadFly
> (MySQL) - the production db has features mapped to assembly units, the
> public/warehouse copy gets flattened to chromosome arm
> coordinates, and
> the adapter code is given a hint not to bother checking
> assembly units.
>
> If you want to use MySQL *and* you want to use assemblies
> *and* you want
> to do your query on a production database *then* you're forced to
> either write really complex SQL *or* do it via adapters, in
> which case you
> lose the ability to specify complex queries declaratively.
>
> Somewhere is this multidimensional matrix of use-cases and DBMSs and
> langauges there's cells that are unsatisfactory or just impossible, I
> don't know whether BioSQL should be aiming to fill in as much of this
> matrix as possible or just to focus on certain areas.
>
> > Maybe someone can help me lifting my confusion.
> >
> > How is this done in GMOD and Ensembl, and how does that map
> to BioSQL
> > with the assembly proposal below?
> >
> > -hilmar
> >
>
>
>
>