[Biopython-dev] New Newick parser in Bio.Phylo

Ben Morris ben at bendmorris.com
Mon Feb 11 04:04:45 UTC 2013


On Sun, Feb 10, 2013 at 10:30 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Sun, Feb 10, 2013 at 9:39 PM, Ben Morris <ben at bendmorris.com> wrote:
>>
>> On Sun, Feb 10, 2013 at 9:11 PM, Eric Talevich <eric.talevich at gmail.com>
>> wrote:
>> > Hi Ben,
>> >
>> > I've noticed a couple new characteristics of the Newick parser that I
>> > had
>> > questions about.
>> >
>> > 1. There is no longer a way to tell the parser to treat internal node
>> > labels
>> > as confidence values. Lots of files in the wild do record the support
>> > values
>> > here, including those generated by RAxML, PhyML, FastTree and MrBayes,
>> > so
>> > I'd like to restore this option, and perhaps make it the default. I
>> > think
>> > the condition is:
>> >
>> > if not (self.values_are_confidence or self.comments_are_confidence or
>> > current_clade.is_terminal()): # parse confidence from node label
>> >
>> > Is there an easy way to add this option to the parser? I'm trying to get
>> > this to work in the "else" clause in parse_tree, where unquoted node
>> > labels
>> > are handled.
>> >
>> >
>> > 2. Confidence values are required to be between 0.0 and 1.0. Also,
>> > support
>> > values recorded as integers are treated as percentages and divided by
>> > 100
>> > automatically. The phyloXML spec doesn't have this range requirement.
>> > RAxML
>> > scales bootstraps to 100, but PhyML records the raw number of supporting
>> > bootstrap runs (e.g. supports out of 1000 if there were 1000 bootstrap
>> > replicates). So, I'd prefer to leave the confidence values as they are,
>> > requiring only that they be numeric. Thoughts?
>> >
>> >
>> > Thanks,
>> > Eric
>>
>> 1. One issue is that current_clade.is_terminal() will always be true
>> at that point because current_clade's children haven't been parsed
>> yet. Putting the check in the "process_clade" function (which is
>> called when the closing paren is hit, and therefore all children
>> should have been parsed) should fix this.
>>
>> So, if values_are_confidence and comments_are_confidence are both
>> false and a node label is numeric, it should be treated as confidence,
>> and clade.name should be set to None - is that correct?
>>
>> 2. This should be as simple as removing current lines 123-127.
>>
>> ~Ben
>
>
>
> Thanks. Here's #2:
> https://github.com/biopython/biopython/commit/0aee549e72fe5dcf9bcea239d29780706500922a
>
> I agree with your assessment of #1, but haven't been able to get it working
> yet. I'm leaving Bug #3407 open for now:
> https://redmine.open-bio.org/issues/3407
>

I think this should do it:

https://github.com/bendmorris/biopython/commit/b430f27ff908f07d8ab59bec48429947f0028d63

I also updated the test case to make sure this is working correctly
and changed the default value of comments_are_confidences from True to
False.

If that looks correct, feel free to pull.

~Ben



More information about the Biopython-dev mailing list