[BioRuby] gsoc questions

Wed Jun 9 21:50:16 EDT 2010

Hi Sara:
> 
> 
> On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn <sararayburn at gmail.com 
> <mailto:sararayburn at gmail.com>> wrote:
> 
>     Hi Christian and Diana,
> 
>     Two questions: 
> 
>     1) On the phylosoft website for forester/sdi
>     (http://www.phylosoft.org/forester/applications/sdi_r/) I've read
>     this about the two trees: 
>     "The important point to keep in mind is that there must be at least
>     one sub-element of the 'taxonomy' element which allows to match the
>     sequences in the gene tree with a taxonomy in the species tree. In
>     this example this sub-element of the 'taxonomy' element is 'code'."
> 
>     Does this mean that the sub-element for matching will *always* be
>     'code'? Or should I just be looking for anything at all that
>     matches? Also, will all phyloxml trees have the 'code' sub-element?
> 
> 
> To find out whether some element will always contain some other element 
> you can look at PhyloXML documentation [0]. For example at the Taxonomy 
> element documentation [1] you can see that it has a sub-element "code" 
> which is [0..1], which means that there either is no "code" sub-element 
> or there is one and no more, whereas there could none or many "synonym" 
> sub-elements
> 
> [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html
> [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454

Good point! This matching of taxonomic information is a crucial point.
I recommend to implement this in the same manner as it is implemented in 
the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class 
of the forester library, see:
http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup

In this (Java) class the matching works like this:

1. If both the two Taxonomies to be compared have identifiers with the 
same source (e.g. NCBI taxonomy), use these identifiers to match.

  In Java:
   if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) )
   {
     return getIdentifier().isEqual( tax.getIdentifier() );
   }

2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy 
codes to match.

  In Java:
   else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) &&
             !ForesterUtil.isEmpty( tax.getTaxonomyCode() ) )
   {
     return getTaxonomyCode().equals( tax.getTaxonomyCode() );
   }

3. Otherwise, if both Taxonomies have scientific names, use the 
scientific names to match.

4. Otherwise, if both Taxonomies have common names, use the common names 
to match.

5. Otherwise, matching is not possible and an error should be thrown.

Generally speaking, I recommend to get the source code of forester and 
look at the classes in the org.forester.sdi directory (especially 
SDI.java, SDIse.java, and SDIR.java).

> 
>     2) Here's my assumptions about the final output of the algorithm:
>     Each node in the tree should be updated with speciation OR
>     duplication, and the tree as a whole has a count of
>     speciation/duplication events. Am I on the right track here?

Yes, the primary goal of the algorithm is to calculate for each node in 
the gene tree whether it is a duplication or a speciation, and thus each 
node should be annotated as duplication or speciation.
Keeping track of the sum of duplications and speciations is useful too, 
but cannot, as far as I know, stored in the tree object itself.
Maybe the algorithm could return a small "SDI_result" object which is 
used to store such "summary" information.

Christian