[Biopython-dev] BioGeography update/BioPython tree module discussion

Tue Jul 21 12:22:13 UTC 2009

Hi Nick;

> 1. Bug fix on Nexus.Tree class is working well so far.  Thanks Brad!!

Sweet. Glad to hear it.

> 2. Code refactoring: this is basically the layout I've got going at the 
> moment.  (long outline & function descriptions below)

Is this checked in on GitHub? I pulled from the Geography
branch but didn't get the new code. The organization below looks
great and really helps with clarity. One additional suggestion I
would make is to prefix classes which are not part of the public API
with an underscore (_internal_function). Just from the descriptions,
I image some of the functions like xml_burrow_up_cousin would not be
called directly by users.

> 3. GbifXml is working, my next task is the TreeSum class which requires 
> re-doing the functions which made use of the lagrange tree class.  I've 
> built these functions under several different tree classes since January 
> and have gotten pretty good at tree logic so this shouldn't be too hard.

Great. Have you had a look at Eric's generic Tree proposal, which he
was working on this week:

http://github.com/etal/biopython/tree/phyloxml/Bio/Tree

It would be great to propose general functionality there so it can
be rolled into PhyloXML and ultimately Nexus parsing as well.

> 4. Philosophy question: If I build some functions that do something new 
> with an e.g. ElementTree (XML tree) object, should I:
> 
> (a) make these functions go in a subclass of the class for the original 
> object (thus inheriting the methods of the original class, and basically 
> adding new methods).  E.g. basically extending the methods of 
> ElementTree, with a subclass GbifElementTree; or:
> 
> (b) make a class containing the object as an attribute, with e.g. 
> GbifXml.xmltree containing an ElementTree attribute which then gets 
> passed to the various functions.
> 
> I currently have (b) but the more I think about it, the more (a) makes 
> more sense from a simplicity/usability/maintainability sense.

My vote would be for your (b) option. ElementTree is a pretty tricky
interface with overrides for attribute access, so inheriting from it
could be a bit tricky and more trouble than it's worse. If you find
yourself mirroring ElementTree functionality, you could always make
the tree itself a public attribute and encourage users to call it
directly.

Brad

> 
> Cheers!
> Nick
> 
> ==========
> Class for accessing GBIF, downloading records, processing them, and 
> extracting information from the xmltree in that class.
> 
> class GbifXmlError(Exception): pass
> class GbifXml():
>    gbifxml is a class for holding and processing xmltrees of GBIF records.
> 
>    def __init__(self, xmltree=None):
> 
>      This is an instantiation class for setting up new objects of this 
> class.
> 
>    def print_xmltree(self):
> 
>      Prints all the elements & subelements of the xmltree to screen (may 
> require
>      fix_ASCII to input file to succeed)
> 
>    def print_subelements(self, element):
> 
>      Takes an element from an XML tree and prints the subelements tag & 
> text, and
>      the within-tag items (key/value or whatnot)
> 
> 
>    def element_items_to_dictionary(self, element_items):
> 
>      If the XML tree element has items encoded in the tag, e.g. key/value or
>      whatever, this function puts them in a python dictionary and returns
>      them.
> 
> 
> 
>    def extract_latlongs(self, element):
> 
>      Create a temporary pseudofile, extract lat longs to it,
>      return results as string.
> 
>      Inspired by: http://www.skymind.com/~ocrow/python_string/
>      (Method 5: Write to a pseudo file)
> 
> 
>    def extract_latlong_datum(self, element, file_str):
> 
>      Searches an element in an XML tree for lat/long information, and the
>      complete name. Searches recursively, if there are subelements.
> 
> 
> 
>    def extract_taxonconceptkeys_tofile(self, element, outfh):
> 
>      Searches an element in an XML tree for TaxonOccurrence gbifKeys, 
> and the complete sname. Searches recursively, if there are subelements. 
>   Returns file at outfh.
> 
> 
> 
> 
>    def extract_taxonconceptkeys_tolist(self, element, output_list):
> 
>      Searches an element in an XML tree for TaxonOccurrence gbifKeys, 
> and the complete name. Searches recursively, if there are subelements. 
> Returns list.
> 
> 
> 
> 
> 
>    def extract_occurrence_elements(self, element, output_list):
> 
>      Returns a list of the elements, picking elements by 
> TaxonOccurrence; this should
>      return a list of elements equal to the number of hits.
> 
> 
> 
> 
>    def find_to_elements_w_ancs(self, el_tag, anc_el_tag):
> 
>      Burrow into XML to get an element with tag el_tag, return only 
> those el_tags underneath a particular parent element parent_el_tag
> 
> 
>    def create_sub_xmltree(self, element):
> 
>      Create a subset xmltree (to avoid going back to irrelevant parents)
> 
> 
> 
>    def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, 
> match_el_list):
> 
>      Recursively burrows down to find whatever elements with el_tag 
> exist inside a parent_el_tag.
> 
> 
>    def xml_burrow_up(self, element, anc_el_tag, found_anc):
> 
>      Burrow up xml to find anc_el_tag
> 
> 
> 
>    def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
> 
>      Burrow up from element of interest, until a cousin is found with 
> cousin_el_tag
> 
> 
> 
>    def return_parent_in_xmltree(self, child_to_search_for):
> 
>      Search through an xmltree to get the parent of child_to_search_for
> 
> 
> 
>    def return_parent_in_element(self, potential_parent, 
> child_to_search_for, returned_parent):
> 
>      Search through an XML element to return parent of child_to_search_for
> 
> 
> 
>    def find_1st_matching_element(self, element, el_tag, return_element):
> 
>      Burrow down into the XML tree, retrieve the first element with the 
> matching tag
> 
> 
> 
> 
> # Functions devoted to accessing/downloading GBIF records
> 
> def access_gbif(url, params):
> 
>    # Helper function to access various GBIF services
>    #
>    # choose the URL ("url") from here:
>    # http://data.gbif.org/ws/rest/occurrence
>    #
>    # params are a dictionary of key/value pairs
>    #
>    # "_open" is from Bio.Entrez._open, online here:
>    # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open
>    #
>    # Get the handle of results
>    # (looks like e.g.: <addinfourl at 75575128 whose fp = 
> <socket._fileobject object at 0x48117f0>> )
> 
>    # (open with results_handle.read() )
> 
> 
> def get_hits(params):
> 
>    Get the actual hits that are be returned by a given search
>    (this allows parsing & gradual downloading of searches larger
>    than e.g. 1000 records)
>    It will return the LAST non-none instance (in a standard search 
> result there
>    should be only one, anyway).
> 
> 
> def get_xml_hits(params):
> 
>    Returns hits like get_hits, but returns a parsed XML tree.
> 
> 
> def get_all_records_by_increment(params, inc, prefix_fn):
> 
>    Download all of the records in stages, store in list of elements.
>    Increments of e.g. 100 to not overload server
> 
> def get_record(key):
> 
>    Get a single record, return xmltree for it.
> 
> 
> def get_numhits(params):
> 
>    Get the number of hits that will be returned by a given search
>    (this allows parsing & gradual downloading of searches larger
>    than e.g. 1000 records)
>    It will return the LAST non-none instance (in a standard search 
> result there
>    should be only one, anyway).
> 
> def extract_numhits(element):
> 
>    # Search an element of a parsed XML string and find the
>    # number of hits, if it exists.  Recursively searches,
>    # if there are subelements.
>    #
> 
> def xmlstring_to_xmltree(xmlstring):
> 
>    Take the text string returned by GBIF and parse to an XML tree using 
> ElementTree.
>    Requires the intermediate step of saving to a temporary file 
> (required to make
>    ElementTree.parse work, apparently)
> 
> 
> 
> 
> class TreeSum()
> 
>    Summary statistics on trees (some of these now redundant with 
> Nexus.Tree & will be eliminated.
> 
>    def read_ultrametric_Newick(newickstr):
> 
>      Read a Newick file into a tree object (a series of node objects 
> links to parent and daughter nodes), also reading node ages and node 
> labels if any.
> 
> 
>    def list_leaves(phylo_obj):
> 
>      Print out all of the leaves in above a node object
> 
> 
> 
>    def treelength(node):
> 
>      Gets the total branchlength above a given node by recursively 
> adding through tree.
> 
> 
>    def phylodistance(node1, node2):
> 
>      Get the phylogenetic distance (branch length) between two nodes.
> 
> 
>    def get_distance_matrix(phylo_obj):
> 
>      Get a matrix of all of the pairwise distances between the tips of a 
> tree.
> 
> 
> 
>    def get_mrca_array(phylo_obj):
> 
>      Get a square list of lists (array) listing the mrca of each pair of 
> leaves
>      (half-diagonal matrix)
> 
> 
> 
>    def subset_tree(phylo_obj, list_to_keep):
> 
>      Given a list of tips and a tree, remove all other tips and 
> resulting redundant nodes to produce a new smaller tree.
> 
> 
>    def prune_single_desc_nodes(node):
> 
>      Follow a tree from the bottom up, pruning any nodes with only one 
> descendent
> 
> 
>    def find_new_root(node):
> 
>      Search up tree from root and make new root at first divergence
> 
> 
>    def make_None_list_array(xdim, ydim):
> 
>      Make a list of lists ("array") with the specified dimensions
> 
> 
>    def get_PD_to_mrca(node, mrca, PD):
> 
>      Add up the phylogenetic distance from a node to the specified 
> ancestor (mrca).  Find mrca with find_1st_match.
> 
> 
> 
>    def get_ancestors_list(node, anc_list):
> 
>      Get the list of ancestors of a given node
> 
> 
> 
> 
>    def addup_PD(node, PD):
> 
>      Adds the branchlength of the current node to the total PD measure.
> 
> 
>    def print_tree_outline_format(phylo_obj):
> 
>      Prints the tree out in "outline" format (daughter clades are 
> indented, etc.)
> 
> 
>    def print_Node(node, rank):
> 
>      Prints the node in question, and recursively all daughter nodes, 
> maintaining rank as it goes.
> 
> 
> 
> class Ranges():
> 
>    Geographic range of a species (collection of points, results
>    of classification of those points into regions), GIS-like functions for
>    processing them.
> 
> 
>    class Points():
> 
>    geographic locations of individual collected specimens
> 
> 
>    def readshpfile(fn):
> 
>    def summarize_shapefile(fn, output_option, outfn):
> 
>    def point_inside_polygon(x,y,poly):
> 
>    def shapefile_points_in_poly(pt_records, poly):
> 
>    def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly):
> 
> ==========
> 
> 
> Here is a summary of the
> 
> Nick Matzke wrote:
> > Thanks for the fix!!!  A big help.  I am currently organizing my 
> > functions into several classes and making sure they work, basically the 
> > classes look like they will be something like:
> > 
> > ==========
> > GbifXml -- for processing GBIF XML results (all of the functions for 
> > searching/extracting stuff from xmltree structures)
> > 
> > TreeSum -- for processing trees & getting summary statistics etc.
> > 
> > Ranges -- Geographic range of a species (collection of points, results 
> > of classification of those points into regions), GIS-like functions for 
> > processing them
> >   Points -- geographic locations of individual collected specimens
> > ==========
> > 
> > 
> > Brad Chapman wrote:
> >> Hi Nick;
> >> Thanks for the comprehensive update. It sounds like your discussion
> >> with Eric resolved most of the questions about the tree
> >> representation. It's great to see y'all converging on this.
> >>
> >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the 
> >>> solution for now, I will reorganize my code accordingly based on 
> >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a 
> >>> function stripping out node labels.  Also I have not forgotten 
> >>> previous comments from Brad et al. about bringing the other code up 
> >>> to specs. So I will update the BioGeography schedule and overall 
> >>> organization I hope to have at the end (with classes/methods etc., 
> >>> instead of just a list-o-functions, which is how my original schedule 
> >>> was explicitly laid out), and post an update when done.
> >>
> >> Agreed, and seconding Hilmar that the best thing about open source
> >> code is having others looking at your code. Conversely, feel free to
> >> dig in and fix current code where it is holding you up. To remove
> >> this blocking issue on Nexus and get us rolling again, I
> >> put together an initial fix. You can grab the patch from:
> >>
> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788
> >>
> >> Let us know if this works for your files of interest.
> >>
> >> If this clears up the Nexus issue, it would be great to see the
> >> revised schedule incorporating the refactoring. Sounds like we are 
> >> moving in the right direction. Good stuff.
> >>
> >> Thanks,
> >> Brad
> >>
> > 
> 
> -- 
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
> 
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page: 
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
> 
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
> 
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When people 
> thought the earth was spherical, they were wrong. But if you think that 
> thinking the earth is spherical is just as wrong as thinking the earth 
> is flat, then your view is wronger than both of them put together."
> 
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
> 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================