[Biopython-dev] BioGeography update/BioPython tree module discussion
Nick Matzke
matzke at berkeley.edu
Mon Jul 20 19:13:59 UTC 2009
Hi all, here is my weekly update...
1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!!
2. Code refactoring: this is basically the layout I've got going at the
moment. (long outline & function descriptions below)
3. GbifXml is working, my next task is the TreeSum class which requires
re-doing the functions which made use of the lagrange tree class. I've
built these functions under several different tree classes since January
and have gotten pretty good at tree logic so this shouldn't be too hard.
4. Philosophy question: If I build some functions that do something new
with an e.g. ElementTree (XML tree) object, should I:
(a) make these functions go in a subclass of the class for the original
object (thus inheriting the methods of the original class, and basically
adding new methods). E.g. basically extending the methods of
ElementTree, with a subclass GbifElementTree; or:
(b) make a class containing the object as an attribute, with e.g.
GbifXml.xmltree containing an ElementTree attribute which then gets
passed to the various functions.
I currently have (b) but the more I think about it, the more (a) makes
more sense from a simplicity/usability/maintainability sense.
Cheers!
Nick
==========
Class for accessing GBIF, downloading records, processing them, and
extracting information from the xmltree in that class.
class GbifXmlError(Exception): pass
class GbifXml():
gbifxml is a class for holding and processing xmltrees of GBIF records.
def __init__(self, xmltree=None):
This is an instantiation class for setting up new objects of this
class.
def print_xmltree(self):
Prints all the elements & subelements of the xmltree to screen (may
require
fix_ASCII to input file to succeed)
def print_subelements(self, element):
Takes an element from an XML tree and prints the subelements tag &
text, and
the within-tag items (key/value or whatnot)
def element_items_to_dictionary(self, element_items):
If the XML tree element has items encoded in the tag, e.g. key/value or
whatever, this function puts them in a python dictionary and returns
them.
def extract_latlongs(self, element):
Create a temporary pseudofile, extract lat longs to it,
return results as string.
Inspired by: http://www.skymind.com/~ocrow/python_string/
(Method 5: Write to a pseudo file)
def extract_latlong_datum(self, element, file_str):
Searches an element in an XML tree for lat/long information, and the
complete name. Searches recursively, if there are subelements.
def extract_taxonconceptkeys_tofile(self, element, outfh):
Searches an element in an XML tree for TaxonOccurrence gbifKeys,
and the complete sname. Searches recursively, if there are subelements.
Returns file at outfh.
def extract_taxonconceptkeys_tolist(self, element, output_list):
Searches an element in an XML tree for TaxonOccurrence gbifKeys,
and the complete name. Searches recursively, if there are subelements.
Returns list.
def extract_occurrence_elements(self, element, output_list):
Returns a list of the elements, picking elements by
TaxonOccurrence; this should
return a list of elements equal to the number of hits.
def find_to_elements_w_ancs(self, el_tag, anc_el_tag):
Burrow into XML to get an element with tag el_tag, return only
those el_tags underneath a particular parent element parent_el_tag
def create_sub_xmltree(self, element):
Create a subset xmltree (to avoid going back to irrelevant parents)
def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
match_el_list):
Recursively burrows down to find whatever elements with el_tag
exist inside a parent_el_tag.
def xml_burrow_up(self, element, anc_el_tag, found_anc):
Burrow up xml to find anc_el_tag
def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
Burrow up from element of interest, until a cousin is found with
cousin_el_tag
def return_parent_in_xmltree(self, child_to_search_for):
Search through an xmltree to get the parent of child_to_search_for
def return_parent_in_element(self, potential_parent,
child_to_search_for, returned_parent):
Search through an XML element to return parent of child_to_search_for
def find_1st_matching_element(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag
# Functions devoted to accessing/downloading GBIF records
def access_gbif(url, params):
# Helper function to access various GBIF services
#
# choose the URL ("url") from here:
# http://data.gbif.org/ws/rest/occurrence
#
# params are a dictionary of key/value pairs
#
# "_open" is from Bio.Entrez._open, online here:
# http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open
#
# Get the handle of results
# (looks like e.g.: <addinfourl at 75575128 whose fp =
<socket._fileobject object at 0x48117f0>> )
# (open with results_handle.read() )
def get_hits(params):
Get the actual hits that are be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search
result there
should be only one, anyway).
def get_xml_hits(params):
Returns hits like get_hits, but returns a parsed XML tree.
def get_all_records_by_increment(params, inc, prefix_fn):
Download all of the records in stages, store in list of elements.
Increments of e.g. 100 to not overload server
def get_record(key):
Get a single record, return xmltree for it.
def get_numhits(params):
Get the number of hits that will be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search
result there
should be only one, anyway).
def extract_numhits(element):
# Search an element of a parsed XML string and find the
# number of hits, if it exists. Recursively searches,
# if there are subelements.
#
def xmlstring_to_xmltree(xmlstring):
Take the text string returned by GBIF and parse to an XML tree using
ElementTree.
Requires the intermediate step of saving to a temporary file
(required to make
ElementTree.parse work, apparently)
class TreeSum()
Summary statistics on trees (some of these now redundant with
Nexus.Tree & will be eliminated.
def read_ultrametric_Newick(newickstr):
Read a Newick file into a tree object (a series of node objects
links to parent and daughter nodes), also reading node ages and node
labels if any.
def list_leaves(phylo_obj):
Print out all of the leaves in above a node object
def treelength(node):
Gets the total branchlength above a given node by recursively
adding through tree.
def phylodistance(node1, node2):
Get the phylogenetic distance (branch length) between two nodes.
def get_distance_matrix(phylo_obj):
Get a matrix of all of the pairwise distances between the tips of a
tree.
def get_mrca_array(phylo_obj):
Get a square list of lists (array) listing the mrca of each pair of
leaves
(half-diagonal matrix)
def subset_tree(phylo_obj, list_to_keep):
Given a list of tips and a tree, remove all other tips and
resulting redundant nodes to produce a new smaller tree.
def prune_single_desc_nodes(node):
Follow a tree from the bottom up, pruning any nodes with only one
descendent
def find_new_root(node):
Search up tree from root and make new root at first divergence
def make_None_list_array(xdim, ydim):
Make a list of lists ("array") with the specified dimensions
def get_PD_to_mrca(node, mrca, PD):
Add up the phylogenetic distance from a node to the specified
ancestor (mrca). Find mrca with find_1st_match.
def get_ancestors_list(node, anc_list):
Get the list of ancestors of a given node
def addup_PD(node, PD):
Adds the branchlength of the current node to the total PD measure.
def print_tree_outline_format(phylo_obj):
Prints the tree out in "outline" format (daughter clades are
indented, etc.)
def print_Node(node, rank):
Prints the node in question, and recursively all daughter nodes,
maintaining rank as it goes.
class Ranges():
Geographic range of a species (collection of points, results
of classification of those points into regions), GIS-like functions for
processing them.
class Points():
geographic locations of individual collected specimens
def readshpfile(fn):
def summarize_shapefile(fn, output_option, outfn):
def point_inside_polygon(x,y,poly):
def shapefile_points_in_poly(pt_records, poly):
def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly):
==========
Here is a summary of the
Nick Matzke wrote:
> Thanks for the fix!!! A big help. I am currently organizing my
> functions into several classes and making sure they work, basically the
> classes look like they will be something like:
>
> ==========
> GbifXml -- for processing GBIF XML results (all of the functions for
> searching/extracting stuff from xmltree structures)
>
> TreeSum -- for processing trees & getting summary statistics etc.
>
> Ranges -- Geographic range of a species (collection of points, results
> of classification of those points into regions), GIS-like functions for
> processing them
> Points -- geographic locations of individual collected specimens
> ==========
>
>
> Brad Chapman wrote:
>> Hi Nick;
>> Thanks for the comprehensive update. It sounds like your discussion
>> with Eric resolved most of the questions about the tree
>> representation. It's great to see y'all converging on this.
>>
>>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the
>>> solution for now, I will reorganize my code accordingly based on
>>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a
>>> function stripping out node labels. Also I have not forgotten
>>> previous comments from Brad et al. about bringing the other code up
>>> to specs. So I will update the BioGeography schedule and overall
>>> organization I hope to have at the end (with classes/methods etc.,
>>> instead of just a list-o-functions, which is how my original schedule
>>> was explicitly laid out), and post an update when done.
>>
>> Agreed, and seconding Hilmar that the best thing about open source
>> code is having others looking at your code. Conversely, feel free to
>> dig in and fix current code where it is holding you up. To remove
>> this blocking issue on Nexus and get us rolling again, I
>> put together an initial fix. You can grab the patch from:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2788
>>
>> Let us know if this works for your files of interest.
>>
>> If this clears up the Nexus issue, it would be great to see the
>> revised schedule incorporating the refactoring. Sounds like we are
>> moving in the right direction. Good stuff.
>>
>> Thanks,
>> Brad
>>
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
More information about the Biopython-dev
mailing list