[BioRuby] gsoc update
Christian M Zmasek
czmasek at burnham.org
Tue Jun 22 18:29:09 UTC 2010
Hi, Sara:
Hopefully you and your son are fully recovered now!
To me, Bio::Algorithm::SDI would make the most sense.
Re: "It seems that forester has the assumption built in that any node in
a tree that has a child must have two children. Is this a property of
phylogenetic trees?"
Being composed of entirely binary nodes is indeed a property of trees
produced by most programs for phylogenetic inference. In contrast, if
multiple (binary) trees are used to calculate a consensus tree (e.g.
bootstrap resampling), then the resulting consensus tree might contain
nodes with more than two children (depending on the method of consensus
tree calculation and the degree of divergence among the resampled
trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually"
created (or by various "supertree" approaches), nodes with more than two
children are oftentimes used to express uncertainty.
For the purpose of gene duplication inference, it would be particularly
useful to allow non-binary species trees (expressing uncertainty about
the tree-of-life and preventing the introduction of spurious duplications).
Re: "For the non-binary case, should I go forward planning to implement
the algorithm from the Vernot et al. paper or should I be planning to
extend your algorithm?"
You should plan on working on the SDI algorithm and 'modify' it so that
it correctly works on non-binary species trees.
Now, this is easier said than done.
A while ago, I developed such an algorithm and implemented it as
org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file
org/forester/sdi/GSDI.
Yet, the big issue is that while this algorithm seems to work, I don't
have a mathematical proof for its correctness.
In any case, I recommend to do the following:
1. Thoroughly test (and writes unit tests) your current implementation
of binary SDI. For example, does it correctly use the different
sub-elements of taxonomy for matching, i.e. does it work if both species
and gene use scientific names for taxonomic identification? does it
work if both species and gene use NCBI identifier for taxonomic
identification? does it work if both species and gene use NCBI
identifier for taxonomic identification but also have non-matching
common names (in this case it should use the identifiers and ignore
common names)? Will it throw an exception if no matching sub-elements of
taxonomy are present?
2. Performing timing benchmarks. Does it behave similar (although
overall slower) to the Java implementation (see Figure 4 in Zmasek and
Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a
indication of an underlying problem?
3. I will look at your implementation as well.
4. Look at org.forester.sdi.GSDI and see if you can understand it and
test it on paper. If this makes sense to you then we can go ahead and
plan implementing this within BioRuby.
Christian
Sara Rayburn wrote:
> Hi,
>
> Well, as far as I can tell, things are looking much, much better. I'm sorry I got a bit behind, but my son and I have been sick this past week.
>
> For the namespace/file locations, the response from the mailing list has been:
> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
> Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI
>
> What do you guys think?
>
> Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?
>
> Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink.
>
> For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm?
>
> Thanks and again, sorry for getting a bit behind.
>
> Sara
More information about the BioRuby
mailing list