[BioRuby] gsoc update

Tue Jun 22 18:29:09 UTC 2010

Hi, Sara:

Hopefully you and your son are fully recovered now!

To me, Bio::Algorithm::SDI would make the most sense.

Re: "It seems that forester has the assumption built in that any node in 
a tree that has a child must have two children. Is this a property of 
phylogenetic trees?"

Being composed of entirely binary nodes is indeed a property of trees 
produced by most programs for phylogenetic inference. In contrast, if 
multiple (binary) trees are used to calculate a consensus tree (e.g. 
bootstrap resampling), then the resulting consensus tree might contain 
nodes with more than two children (depending on the method of consensus 
tree calculation and the degree of divergence among the resampled 
trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually" 
created (or by various "supertree" approaches), nodes with more than two 
children are oftentimes used to express uncertainty.

For the purpose of gene duplication inference, it would be particularly 
useful to allow non-binary species trees (expressing uncertainty about 
the tree-of-life and preventing the introduction of spurious duplications).

Re: "For the non-binary case, should I go forward planning to implement 
the algorithm from the Vernot et al. paper or should I be planning to 
extend your algorithm?"

You should plan on working on the SDI algorithm and 'modify' it so that 
it correctly works on non-binary species trees.
Now, this is easier said than done.
A while ago, I developed such an algorithm and implemented it as 
org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file 
org/forester/sdi/GSDI.
Yet, the big issue is that while this algorithm seems to work, I don't 
have a mathematical proof for its correctness.

In any case, I recommend to do the following:
1. Thoroughly test (and writes unit tests) your current implementation 
of binary SDI. For example, does it correctly use the different 
sub-elements of taxonomy for matching, i.e. does it work if both species 
  and gene use scientific names for taxonomic identification? does it 
work if both species and gene use NCBI identifier for taxonomic 
identification? does it work if both species and gene use NCBI 
identifier for taxonomic identification but also have non-matching 
common names (in this case it should use the identifiers and ignore 
common names)? Will it throw an exception if no matching sub-elements of 
taxonomy are present?
2. Performing timing benchmarks. Does it behave similar (although 
overall slower) to the Java implementation (see Figure 4 in Zmasek and 
Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a 
indication of an underlying problem?
3. I will look at your implementation as well.
4. Look at org.forester.sdi.GSDI and see if you can understand it and 
test it on paper. If this makes sense to you then we can go ahead and 
plan implementing this within BioRuby.

Christian

Sara Rayburn wrote:
> Hi,
> 
> Well, as far as I can tell, things are looking much, much better.  I'm sorry I got a bit behind, but my son and I have been sick this past week. 
> 
> For the namespace/file locations, the response from the mailing list has been:
> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
> Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI
> 
> What do you guys think?
> 
> Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?
> 
> Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink.
> 
> For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm? 
> 
> Thanks and again, sorry for getting a bit behind.
> 
> Sara