[Biopython-dev] BioGeography update/BioPython tree module discussion

Nick Matzke matzke at berkeley.edu
Mon Jul 20 19:13:59 UTC 2009

Hi all, here is my weekly update...

1. Bug fix on Nexus.Tree class is working well so far.  Thanks Brad!!

2. Code refactoring: this is basically the layout I've got going at the 
moment.  (long outline & function descriptions below)

3. GbifXml is working, my next task is the TreeSum class which requires 
re-doing the functions which made use of the lagrange tree class.  I've 
built these functions under several different tree classes since January 
and have gotten pretty good at tree logic so this shouldn't be too hard.

4. Philosophy question: If I build some functions that do something new 
with an e.g. ElementTree (XML tree) object, should I:

(a) make these functions go in a subclass of the class for the original 
object (thus inheriting the methods of the original class, and basically 
adding new methods).  E.g. basically extending the methods of 
ElementTree, with a subclass GbifElementTree; or:

(b) make a class containing the object as an attribute, with e.g. 
GbifXml.xmltree containing an ElementTree attribute which then gets 
passed to the various functions.

I currently have (b) but the more I think about it, the more (a) makes 
more sense from a simplicity/usability/maintainability sense.


Class for accessing GBIF, downloading records, processing them, and 
extracting information from the xmltree in that class.

class GbifXmlError(Exception): pass
class GbifXml():
   gbifxml is a class for holding and processing xmltrees of GBIF records.

   def __init__(self, xmltree=None):

     This is an instantiation class for setting up new objects of this 

   def print_xmltree(self):

     Prints all the elements & subelements of the xmltree to screen (may 
     fix_ASCII to input file to succeed)

   def print_subelements(self, element):

     Takes an element from an XML tree and prints the subelements tag & 
text, and
     the within-tag items (key/value or whatnot)

   def element_items_to_dictionary(self, element_items):

     If the XML tree element has items encoded in the tag, e.g. key/value or
     whatever, this function puts them in a python dictionary and returns

   def extract_latlongs(self, element):

     Create a temporary pseudofile, extract lat longs to it,
     return results as string.

     Inspired by: http://www.skymind.com/~ocrow/python_string/
     (Method 5: Write to a pseudo file)

   def extract_latlong_datum(self, element, file_str):

     Searches an element in an XML tree for lat/long information, and the
     complete name. Searches recursively, if there are subelements.

   def extract_taxonconceptkeys_tofile(self, element, outfh):

     Searches an element in an XML tree for TaxonOccurrence gbifKeys, 
and the complete sname. Searches recursively, if there are subelements. 
  Returns file at outfh.

   def extract_taxonconceptkeys_tolist(self, element, output_list):

     Searches an element in an XML tree for TaxonOccurrence gbifKeys, 
and the complete name. Searches recursively, if there are subelements. 
Returns list.

   def extract_occurrence_elements(self, element, output_list):

     Returns a list of the elements, picking elements by 
TaxonOccurrence; this should
     return a list of elements equal to the number of hits.

   def find_to_elements_w_ancs(self, el_tag, anc_el_tag):

     Burrow into XML to get an element with tag el_tag, return only 
those el_tags underneath a particular parent element parent_el_tag

   def create_sub_xmltree(self, element):

     Create a subset xmltree (to avoid going back to irrelevant parents)

   def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, 

     Recursively burrows down to find whatever elements with el_tag 
exist inside a parent_el_tag.

   def xml_burrow_up(self, element, anc_el_tag, found_anc):

     Burrow up xml to find anc_el_tag

   def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):

     Burrow up from element of interest, until a cousin is found with 

   def return_parent_in_xmltree(self, child_to_search_for):

     Search through an xmltree to get the parent of child_to_search_for

   def return_parent_in_element(self, potential_parent, 
child_to_search_for, returned_parent):

     Search through an XML element to return parent of child_to_search_for

   def find_1st_matching_element(self, element, el_tag, return_element):

     Burrow down into the XML tree, retrieve the first element with the 
matching tag

# Functions devoted to accessing/downloading GBIF records

def access_gbif(url, params):

   # Helper function to access various GBIF services
   # choose the URL ("url") from here:
   # http://data.gbif.org/ws/rest/occurrence
   # params are a dictionary of key/value pairs
   # "_open" is from Bio.Entrez._open, online here:
   # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open
   # Get the handle of results
   # (looks like e.g.: <addinfourl at 75575128 whose fp = 
<socket._fileobject object at 0x48117f0>> )

   # (open with results_handle.read() )

def get_hits(params):

   Get the actual hits that are be returned by a given search
   (this allows parsing & gradual downloading of searches larger
   than e.g. 1000 records)
   It will return the LAST non-none instance (in a standard search 
result there
   should be only one, anyway).

def get_xml_hits(params):

   Returns hits like get_hits, but returns a parsed XML tree.

def get_all_records_by_increment(params, inc, prefix_fn):

   Download all of the records in stages, store in list of elements.
   Increments of e.g. 100 to not overload server

def get_record(key):

   Get a single record, return xmltree for it.

def get_numhits(params):

   Get the number of hits that will be returned by a given search
   (this allows parsing & gradual downloading of searches larger
   than e.g. 1000 records)
   It will return the LAST non-none instance (in a standard search 
result there
   should be only one, anyway).

def extract_numhits(element):

   # Search an element of a parsed XML string and find the
   # number of hits, if it exists.  Recursively searches,
   # if there are subelements.

def xmlstring_to_xmltree(xmlstring):

   Take the text string returned by GBIF and parse to an XML tree using 
   Requires the intermediate step of saving to a temporary file 
(required to make
   ElementTree.parse work, apparently)

class TreeSum()

   Summary statistics on trees (some of these now redundant with 
Nexus.Tree & will be eliminated.

   def read_ultrametric_Newick(newickstr):

     Read a Newick file into a tree object (a series of node objects 
links to parent and daughter nodes), also reading node ages and node 
labels if any.

   def list_leaves(phylo_obj):

     Print out all of the leaves in above a node object

   def treelength(node):

     Gets the total branchlength above a given node by recursively 
adding through tree.

   def phylodistance(node1, node2):

     Get the phylogenetic distance (branch length) between two nodes.

   def get_distance_matrix(phylo_obj):

     Get a matrix of all of the pairwise distances between the tips of a 

   def get_mrca_array(phylo_obj):

     Get a square list of lists (array) listing the mrca of each pair of 
     (half-diagonal matrix)

   def subset_tree(phylo_obj, list_to_keep):

     Given a list of tips and a tree, remove all other tips and 
resulting redundant nodes to produce a new smaller tree.

   def prune_single_desc_nodes(node):

     Follow a tree from the bottom up, pruning any nodes with only one 

   def find_new_root(node):

     Search up tree from root and make new root at first divergence

   def make_None_list_array(xdim, ydim):

     Make a list of lists ("array") with the specified dimensions

   def get_PD_to_mrca(node, mrca, PD):

     Add up the phylogenetic distance from a node to the specified 
ancestor (mrca).  Find mrca with find_1st_match.

   def get_ancestors_list(node, anc_list):

     Get the list of ancestors of a given node

   def addup_PD(node, PD):

     Adds the branchlength of the current node to the total PD measure.

   def print_tree_outline_format(phylo_obj):

     Prints the tree out in "outline" format (daughter clades are 
indented, etc.)

   def print_Node(node, rank):

     Prints the node in question, and recursively all daughter nodes, 
maintaining rank as it goes.

class Ranges():

   Geographic range of a species (collection of points, results
   of classification of those points into regions), GIS-like functions for
   processing them.

   class Points():

   geographic locations of individual collected specimens

   def readshpfile(fn):

   def summarize_shapefile(fn, output_option, outfn):

   def point_inside_polygon(x,y,poly):

   def shapefile_points_in_poly(pt_records, poly):

   def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly):


Here is a summary of the

Nick Matzke wrote:
> Thanks for the fix!!!  A big help.  I am currently organizing my 
> functions into several classes and making sure they work, basically the 
> classes look like they will be something like:
> ==========
> GbifXml -- for processing GBIF XML results (all of the functions for 
> searching/extracting stuff from xmltree structures)
> TreeSum -- for processing trees & getting summary statistics etc.
> Ranges -- Geographic range of a species (collection of points, results 
> of classification of those points into regions), GIS-like functions for 
> processing them
>   Points -- geographic locations of individual collected specimens
> ==========
> Brad Chapman wrote:
>> Hi Nick;
>> Thanks for the comprehensive update. It sounds like your discussion
>> with Eric resolved most of the questions about the tree
>> representation. It's great to see y'all converging on this.
>>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the 
>>> solution for now, I will reorganize my code accordingly based on 
>>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a 
>>> function stripping out node labels.  Also I have not forgotten 
>>> previous comments from Brad et al. about bringing the other code up 
>>> to specs. So I will update the BioGeography schedule and overall 
>>> organization I hope to have at the end (with classes/methods etc., 
>>> instead of just a list-o-functions, which is how my original schedule 
>>> was explicitly laid out), and post an update when done.
>> Agreed, and seconding Hilmar that the best thing about open source
>> code is having others looking at your code. Conversely, feel free to
>> dig in and fix current code where it is holding you up. To remove
>> this blocking issue on Nexus and get us rolling again, I
>> put together an initial fix. You can grab the patch from:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2788
>> Let us know if this works for your files of interest.
>> If this clears up the Nexus issue, it would be great to see the
>> revised schedule incorporating the refactoring. Sounds like we are 
>> moving in the right direction. Good stuff.
>> Thanks,
>> Brad

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
Dept. personal page: 
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

More information about the Biopython-dev mailing list