[Bioperl-l] Re: Comparative genomics

Philip Lijnzaad lijnzaad@ebi.ac.uk
Fri, 28 Sep 2001 13:44:57 +0100


>> Is this a tree per gene ? How many species (nodes) per tree? If the trees are
> Initially I was thinking of taking an EnsEMBL family to build the tree
> from, but generating the families across multiple genomes, how does it
> sound?

Already done: For family120, I have dumped all the peptides of EnsHuman,
EnsMouse and SWISSPROT/SPTREMBL (vertebrates) into one big bag, and clustered
them, so the families are mixed (and alignments are coming)

We'll have to think about the best way to use this cross-species data. 

>> essential, do not use the common NODE(ID, PARENT_ID, NAME) type schema: it's
>> only efficient for getting the direct parents or children of a node. If
>> that's not enough (e.g. you want complete subtrees or lineages etc.), use
>> Celko's NODE(LFT, RGT, ID, NAME) schema. Or a hybrid of the two.

> hmmm... think I need to talk to you about this one, I saw something with
> nested brackets, where each bracket opening was a new level of depth in
> the tree, are we talking about the same thing?

No. The idea is that each node as two IDs (called LFT and RGT), which bracket
all the LFTs and RGTs of its subtree. It looks like this (quoting a frequent
news:comp.databases.theory posting by Joe Celko):

 CREATE TABLE Personnel
 (emp CHAR(10) PRIMARY KEY,
  boss CHAR(10),       -- this column is unneeded
  salary DECIMAL(6,2) NOT NULL,
  lft INTEGER NOT NULL,
  rgt INTEGER NOT NULL);

 Personnel
 emp      boss      salary  lft  rgt
 ===================================
 Albert   NULL     1000.00   1   12
 Bert     Albert    900.00   2    3
 Chuck    Albert    900.00   4   11
 Donna    Chuck     800.00   5    6
 Eddie    Chuck     700.00   7    8
 Fred     Chuck     600.00   9   10

 which would look like this as a directed graph:

            Albert (1,12)
            /        \
          /            \
    Bert (2,3)    Chuck (4,11)
                   /    |   \
                 /      |     \
               /        |       \
             /          |         \
        Donna (5,6)  Eddie (7,8)  Fred (9,10)


So if you have node 'Chuck', its sub-tree is completely implied by the
LFT,RGT pair, and can be retrieved in one SQL stament (OK, in this case, it's
just three children, but never mind). 

If you want more info, let me know. Details are in Celko's SQL for Dummies
(Morgann-Kaufmann, preferablyi 2nd ed.)

(btw I advise not to use his update staments, they are unusable). 

                                                                      Philip
-- 
The mail transport agent is not liable for any coffee stains in this message
-----------------------------------------------------------------------------
Philip Lijnzaad, lijnzaad@ebi.ac.uk \ European Bioinformatics Institute,rm A2-08
+44 (0)1223 49 4639                 / Wellcome Trust Genome Campus, Hinxton
+44 (0)1223 49 4468 (fax)           \ Cambridgeshire CB10 1SD,  GREAT BRITAIN