[BioRuby] gsoc questions
Christian M Zmasek
czmasek at burnham.org
Wed Jun 9 21:50:16 EDT 2010
Hi Sara:
>
>
> On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn <sararayburn at gmail.com
> <mailto:sararayburn at gmail.com>> wrote:
>
> Hi Christian and Diana,
>
> Two questions:
>
> 1) On the phylosoft website for forester/sdi
> (http://www.phylosoft.org/forester/applications/sdi_r/) I've read
> this about the two trees:
> "The important point to keep in mind is that there must be at least
> one sub-element of the 'taxonomy' element which allows to match the
> sequences in the gene tree with a taxonomy in the species tree. In
> this example this sub-element of the 'taxonomy' element is 'code'."
>
> Does this mean that the sub-element for matching will *always* be
> 'code'? Or should I just be looking for anything at all that
> matches? Also, will all phyloxml trees have the 'code' sub-element?
>
>
> To find out whether some element will always contain some other element
> you can look at PhyloXML documentation [0]. For example at the Taxonomy
> element documentation [1] you can see that it has a sub-element "code"
> which is [0..1], which means that there either is no "code" sub-element
> or there is one and no more, whereas there could none or many "synonym"
> sub-elements
>
> [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html
> [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454
Good point! This matching of taxonomic information is a crucial point.
I recommend to implement this in the same manner as it is implemented in
the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class
of the forester library, see:
http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup
In this (Java) class the matching works like this:
1. If both the two Taxonomies to be compared have identifiers with the
same source (e.g. NCBI taxonomy), use these identifiers to match.
In Java:
if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) )
{
return getIdentifier().isEqual( tax.getIdentifier() );
}
2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy
codes to match.
In Java:
else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) &&
!ForesterUtil.isEmpty( tax.getTaxonomyCode() ) )
{
return getTaxonomyCode().equals( tax.getTaxonomyCode() );
}
3. Otherwise, if both Taxonomies have scientific names, use the
scientific names to match.
4. Otherwise, if both Taxonomies have common names, use the common names
to match.
5. Otherwise, matching is not possible and an error should be thrown.
Generally speaking, I recommend to get the source code of forester and
look at the classes in the org.forester.sdi directory (especially
SDI.java, SDIse.java, and SDIR.java).
>
> 2) Here's my assumptions about the final output of the algorithm:
> Each node in the tree should be updated with speciation OR
> duplication, and the tree as a whole has a count of
> speciation/duplication events. Am I on the right track here?
Yes, the primary goal of the algorithm is to calculate for each node in
the gene tree whether it is a duplication or a speciation, and thus each
node should be annotated as duplication or speciation.
Keeping track of the sum of duplications and speciations is useful too,
but cannot, as far as I know, stored in the tree object itself.
Maybe the algorithm could return a small "SDI_result" object which is
used to store such "summary" information.
Christian
More information about the BioRuby
mailing list