[Biopython-dev] [Wg-phyloinformatics] BioGeography update

Brad Chapman chapmanb at 50mail.com
Sat Jul 4 20:11:00 UTC 2009


Hi Nick;
Thanks much for the update. I'm cc'ing in the Biopython dev list to
keep everyone there in the loop as well.

> I have worked out a number of better functions for searching xml 
> database results, i.e. finding all elements with tags y that exist 
> somewhere inside elements with tags x.  This is much more flexible in 
> the event that data of interest resides at different levels of a 
> hierarchy, which I have found in some cases.

Awesome. Echoing what Hilmar mentioned, it would be good to step back
and this point and talk about integration with Biopython. A couple
of thoughts and suggestions along those lines:

- You've included code from Lagrange which worries me for two
  reasons. First, this overlaps with existing Biopython functionality
  in Bio.Nexus; we want to eliminate that as it's confusing for
  users of the package to find different non-compatible
  implementations. If the existing code doesn't work for you in some
  way, could you flesh out those issues on the Biopython dev list so we
  can work to resolve them. Secondly, lagrange is licensed under the
  GPL so practically it is not compatible with Biopython, which is
  licensed much more freely.

- You've settled on a flat system of coding with functions and no
  nesting inside of classes. This makes it difficult to flesh up the
  public API from internal functions. We could help make this more
  clear in a couple of ways:

  - Organizing related functionality into classes.
  - Prefixing internal functions with underscrores to indicate they
    are not meant to be called by users.
  - Starting to provide some user documentation, ideally centered
    around use cases. Often these help provide a way to think about
    the usability of the code and hint at ways to improve it.

Hope this is helpful and I'm happy to offer more specific
suggestions as you dig into it. Have a great 4th of July weekend,

Brad


> Stephen Smith wrote:
> > These look really great. Glad the lagrange tree code is working out. I 
> > am very excited for the merging of the Biopython and the lagrange tree 
> > classes. More details to come.
> > Stephen
> > ==================
> > Stephen A. Smith
> > Postdoctoral Researcher
> > NESCent: National Evolutionary Synthesis Center
> > page: http://blackrim.org
> > blog: http://blackrim.net/semaphoront
> > sasmith at nescent.org
> > 
> > 
> > 
> > On Jun 24, 2009, at 12:47 AM, Nick Matzke wrote:
> > 
> >> OK, here's the latest...
> >>
> >> New functions: a bunch of stuff dealing with phylogenetic trees, making
> >> use of the tree/node class in Stephen Smith's lagrange (GNU public
> >> license), which was superior to the half-baked (and not GPL) tree/node
> >> class I was using before GSoC started.
> >>
> >> =============
> >> read_ultrametric_Newick(newickstr):
> >> Read a Newick file into a tree object (a series of node objects links to
> >> parent and daughter nodes), also reading node ages and node labels if 
> >> any.
> >>
> >> list_leaves(phylo_obj):
> >> Print out all of the leaves in above a node object
> >>
> >> treelength(node):
> >> Gets the total branchlength above a given node by recursively adding
> >> through tree.
> >>
> >> phylodistance(node1, node2):
> >> Get the phylogenetic distance (branch length) between two nodes.
> >>
> >> get_distance_matrix(phylo_obj):
> >> Get a matrix of all of the pairwise distances between the tips of a tree.
> >>
> >> get_mrca_array(phylo_obj):
> >> Get a square list of lists (array) listing the mrca of each pair of
> >> leaves (half-diagonal matrix)
> >>
> >> subset_tree(phylo_obj, list_to_keep):
> >> Given a list of tips and a tree, remove all other tips and resulting
> >> redundant nodes to produce a new smaller tree.
> >>
> >> prune_single_desc_nodes(node):
> >> Follow a tree from the bottom up, pruning any nodes with only one 
> >> descendent
> >>     
> >> find_new_root(node):
> >> Search up tree from root and make new root at first divergence
> >>
> >> make_None_list_array(xdim, ydim):
> >> Make a list of lists ("array") with the specified dimensions   
> >>
> >> get_PD_to_mrca(node, mrca, PD):
> >> Add up the phylogenetic distance from a node to the specified ancestor
> >> (mrca).  Find mrca with find_1st_match.
> >>
> >> find_1st_match(list1, list2):
> >> Find the first match in two ordered lists.
> >>
> >> get_ancestors_list(node, anc_list):
> >> Get the list of ancestors of a given node
> >>
> >> addup_PD(node, PD):
> >> Adds the branchlength of the current node to the total PD measure.
> >>     
> >> print_tree_outline_format(phylo_obj):
> >> Prints the tree out in "outline" format (daughter clades are indented, 
> >> etc.)
> >>
> >> print_Node(node, rank):
> >> Prints the node in question, and recursively all daughter nodes,
> >> maintaining rank as it goes.
> >>
> >> lagrange_disclaimer():
> >> Just prints lagrange citation etc. in code using lagrange libraries.
> >> =============
> >>
> >>
> >>
> >> What's next:
> >>
> >> I'm going to spend the rest of this week following up on Brad's
> >> suggestions to make the code more standard, with the priority of
> >> figuring out how I can revise the current BioPython phylogeny class, to
> >> resemble the better version in lagrange, so that there is a generic
> >> flexible phylogeny/newick parser that can be used generally as well as
> >> by my BioGeography package specifically.
> >>
> >> updated wiki/git:
> >> http://biopython.org/wiki/BioGeography#June.2C_week_3:_Functions_to_read_user-specified_Newick_files_.28with_ages_and_internal_node_labels.29_and_generate_basic_summary_information. 
> >>
> >> http://github.com/nmatzke/biopython/commits/Geography
> >>
> >> Cheers!
> >> Nick
> >>
> >>
> >>
> >>
> >>
> >> Nick Matzke wrote:
> >>> Sorry my update is slow, it is coming in a bit!  Thanks, Nick
> >>>
> >>> Brad Chapman wrote:
> >>>> Nick;
> >>>> Thanks for the update -- hope y'all are having fun at the Evolution
> >>>> meeting and have managed to meet up.
> >>>>
> >>>>> Basically this week I added functions to download & parse large
> >>>>> numbers of records, get TaxonOccurrence gbifKeys, and search with
> >>>>> those keys.  Main functions:
> >>>>
> >>>> Good stuff. My main comment echoes a couple of things we discussed
> >>>> earlier:
> >>>>
> >>>> - It is not clear to a user which functions are API functions to
> >>>>  call and which are used internally. Prefixing the internal
> >>>>  functions with underscores (_) and organizing these into classes
> >>>>  will help with this.
> >>>>
> >>>> - I still noticed some tempfile writing from what we discussed last
> >>>>  week. If you have problems using in memory file handles let us
> >>>>  know and we can discuss more.
> >>>>
> >>>> In general if your coding style is to get it out there and then
> >>>> re-factor, that is cool. But please put some time into the
> >>>> schedule for this so I know not to bug you before you've actually
> >>>> had a chance to go through things a second time. Also, it's a good
> >>>> idea to do this in segments as we go along. From experience, if you
> >>>> build up too much code that needs rework it becomes more mentally
> >>>> difficult to get into the rewriting.
> >>>>
> >>>>> An issue:
> >>>>>
> >>>>> Next week come functions to process phylogenetic trees.  I have had
> >>>>> issues with the current BioPython newick parser etc.; basically what
> >>>>> exists appears to not accept node label information which is required
> >>>>> to store e.g. branchlengths which are crucial for the sorts of things
> >>>>> I have to do in the future.  So unless there is a better suggestion I
> >>>>> plan to upload modify & upload my own tree parsing/using functions.  I
> >>>>> am open to suggestions in this matter.
> >>>>
> >>>> We do not want to introduce duplicated code for Newick tree parsing in
> >>>> Biopython. This is a good opportunity to engage the development list
> >>>> to help figure out how to fix the current parser to do what you
> >>>> need. If you are not sure how to get started, the best way is to get
> >>>> together a small test file that demonstrates your problems, and post
> >>>> it to the list. It would be more useful to everyone to have your
> >>>> fixes in the main parser.
> >>>>
> >>>> Brad
> >>>>
> >>>
> >>
> >> -- 
> >> ====================================================
> >> Nicholas J. Matzke
> >> Ph.D. Candidate, Graduate Student Researcher
> >> Huelsenbeck Lab
> >> Center for Theoretical Evolutionary Genomics
> >> 4151 VLSB (Valley Life Sciences Building)
> >> Department of Integrative Biology
> >> University of California, Berkeley
> >>
> >> Lab websites:
> >> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> >> http://fisher.berkeley.edu/cteg/hlab.html
> >> Dept. personal page:
> >> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> >> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
> >> Lab phone: 510-643-6299
> >> Dept. fax: 510-643-6264
> >> Cell phone: 510-301-0179
> >> Email: matzke at berkeley.edu
> >>
> >> Mailing address:
> >> Department of Integrative Biology
> >> 3060 VLSB #3140
> >> Berkeley, CA 94720-3140
> >>
> >> -----------------------------------------------------
> >> "[W]hen people thought the earth was flat, they were wrong. When people
> >> thought the earth was spherical, they were wrong. But if you think that
> >> thinking the earth is spherical is just as wrong as thinking the earth
> >> is flat, then your view is wronger than both of them put together."
> >>
> >> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
> >> 14(1), 35-44. Fall 1989.
> >> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> >> ====================================================
> >> _______________________________________________
> >> Wg-phyloinformatics mailing list
> >> Wg-phyloinformatics at nescent.org
> >> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics
> > 
> > 
> 
> -- 
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
> 
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page: 
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
> 
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
> 
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When people 
> thought the earth was spherical, they were wrong. But if you think that 
> thinking the earth is spherical is just as wrong as thinking the earth 
> is flat, then your view is wronger than both of them put together."
> 
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
> 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================
> _______________________________________________
> Wg-phyloinformatics mailing list
> Wg-phyloinformatics at nescent.org
> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics



More information about the Biopython-dev mailing list