[Biopython-dev] BioGeography update/BioPython tree module discussion
Brad Chapman
chapmanb at 50mail.com
Tue Jul 21 12:22:13 UTC 2009
Hi Nick;
> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!!
Sweet. Glad to hear it.
> 2. Code refactoring: this is basically the layout I've got going at the
> moment. (long outline & function descriptions below)
Is this checked in on GitHub? I pulled from the Geography
branch but didn't get the new code. The organization below looks
great and really helps with clarity. One additional suggestion I
would make is to prefix classes which are not part of the public API
with an underscore (_internal_function). Just from the descriptions,
I image some of the functions like xml_burrow_up_cousin would not be
called directly by users.
> 3. GbifXml is working, my next task is the TreeSum class which requires
> re-doing the functions which made use of the lagrange tree class. I've
> built these functions under several different tree classes since January
> and have gotten pretty good at tree logic so this shouldn't be too hard.
Great. Have you had a look at Eric's generic Tree proposal, which he
was working on this week:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree
It would be great to propose general functionality there so it can
be rolled into PhyloXML and ultimately Nexus parsing as well.
> 4. Philosophy question: If I build some functions that do something new
> with an e.g. ElementTree (XML tree) object, should I:
>
> (a) make these functions go in a subclass of the class for the original
> object (thus inheriting the methods of the original class, and basically
> adding new methods). E.g. basically extending the methods of
> ElementTree, with a subclass GbifElementTree; or:
>
> (b) make a class containing the object as an attribute, with e.g.
> GbifXml.xmltree containing an ElementTree attribute which then gets
> passed to the various functions.
>
> I currently have (b) but the more I think about it, the more (a) makes
> more sense from a simplicity/usability/maintainability sense.
My vote would be for your (b) option. ElementTree is a pretty tricky
interface with overrides for attribute access, so inheriting from it
could be a bit tricky and more trouble than it's worse. If you find
yourself mirroring ElementTree functionality, you could always make
the tree itself a public attribute and encourage users to call it
directly.
Brad
>
> Cheers!
> Nick
>
> ==========
> Class for accessing GBIF, downloading records, processing them, and
> extracting information from the xmltree in that class.
>
> class GbifXmlError(Exception): pass
> class GbifXml():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
>
> def __init__(self, xmltree=None):
>
> This is an instantiation class for setting up new objects of this
> class.
>
> def print_xmltree(self):
>
> Prints all the elements & subelements of the xmltree to screen (may
> require
> fix_ASCII to input file to succeed)
>
> def print_subelements(self, element):
>
> Takes an element from an XML tree and prints the subelements tag &
> text, and
> the within-tag items (key/value or whatnot)
>
>
> def element_items_to_dictionary(self, element_items):
>
> If the XML tree element has items encoded in the tag, e.g. key/value or
> whatever, this function puts them in a python dictionary and returns
> them.
>
>
>
> def extract_latlongs(self, element):
>
> Create a temporary pseudofile, extract lat longs to it,
> return results as string.
>
> Inspired by: http://www.skymind.com/~ocrow/python_string/
> (Method 5: Write to a pseudo file)
>
>
> def extract_latlong_datum(self, element, file_str):
>
> Searches an element in an XML tree for lat/long information, and the
> complete name. Searches recursively, if there are subelements.
>
>
>
> def extract_taxonconceptkeys_tofile(self, element, outfh):
>
> Searches an element in an XML tree for TaxonOccurrence gbifKeys,
> and the complete sname. Searches recursively, if there are subelements.
> Returns file at outfh.
>
>
>
>
> def extract_taxonconceptkeys_tolist(self, element, output_list):
>
> Searches an element in an XML tree for TaxonOccurrence gbifKeys,
> and the complete name. Searches recursively, if there are subelements.
> Returns list.
>
>
>
>
>
> def extract_occurrence_elements(self, element, output_list):
>
> Returns a list of the elements, picking elements by
> TaxonOccurrence; this should
> return a list of elements equal to the number of hits.
>
>
>
>
> def find_to_elements_w_ancs(self, el_tag, anc_el_tag):
>
> Burrow into XML to get an element with tag el_tag, return only
> those el_tags underneath a particular parent element parent_el_tag
>
>
> def create_sub_xmltree(self, element):
>
> Create a subset xmltree (to avoid going back to irrelevant parents)
>
>
>
> def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
> match_el_list):
>
> Recursively burrows down to find whatever elements with el_tag
> exist inside a parent_el_tag.
>
>
> def xml_burrow_up(self, element, anc_el_tag, found_anc):
>
> Burrow up xml to find anc_el_tag
>
>
>
> def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
>
> Burrow up from element of interest, until a cousin is found with
> cousin_el_tag
>
>
>
> def return_parent_in_xmltree(self, child_to_search_for):
>
> Search through an xmltree to get the parent of child_to_search_for
>
>
>
> def return_parent_in_element(self, potential_parent,
> child_to_search_for, returned_parent):
>
> Search through an XML element to return parent of child_to_search_for
>
>
>
> def find_1st_matching_element(self, element, el_tag, return_element):
>
> Burrow down into the XML tree, retrieve the first element with the
> matching tag
>
>
>
>
> # Functions devoted to accessing/downloading GBIF records
>
> def access_gbif(url, params):
>
> # Helper function to access various GBIF services
> #
> # choose the URL ("url") from here:
> # http://data.gbif.org/ws/rest/occurrence
> #
> # params are a dictionary of key/value pairs
> #
> # "_open" is from Bio.Entrez._open, online here:
> # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open
> #
> # Get the handle of results
> # (looks like e.g.: <addinfourl at 75575128 whose fp =
> <socket._fileobject object at 0x48117f0>> )
>
> # (open with results_handle.read() )
>
>
> def get_hits(params):
>
> Get the actual hits that are be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
> It will return the LAST non-none instance (in a standard search
> result there
> should be only one, anyway).
>
>
> def get_xml_hits(params):
>
> Returns hits like get_hits, but returns a parsed XML tree.
>
>
> def get_all_records_by_increment(params, inc, prefix_fn):
>
> Download all of the records in stages, store in list of elements.
> Increments of e.g. 100 to not overload server
>
> def get_record(key):
>
> Get a single record, return xmltree for it.
>
>
> def get_numhits(params):
>
> Get the number of hits that will be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
> It will return the LAST non-none instance (in a standard search
> result there
> should be only one, anyway).
>
> def extract_numhits(element):
>
> # Search an element of a parsed XML string and find the
> # number of hits, if it exists. Recursively searches,
> # if there are subelements.
> #
>
> def xmlstring_to_xmltree(xmlstring):
>
> Take the text string returned by GBIF and parse to an XML tree using
> ElementTree.
> Requires the intermediate step of saving to a temporary file
> (required to make
> ElementTree.parse work, apparently)
>
>
>
>
> class TreeSum()
>
> Summary statistics on trees (some of these now redundant with
> Nexus.Tree & will be eliminated.
>
> def read_ultrametric_Newick(newickstr):
>
> Read a Newick file into a tree object (a series of node objects
> links to parent and daughter nodes), also reading node ages and node
> labels if any.
>
>
> def list_leaves(phylo_obj):
>
> Print out all of the leaves in above a node object
>
>
>
> def treelength(node):
>
> Gets the total branchlength above a given node by recursively
> adding through tree.
>
>
> def phylodistance(node1, node2):
>
> Get the phylogenetic distance (branch length) between two nodes.
>
>
> def get_distance_matrix(phylo_obj):
>
> Get a matrix of all of the pairwise distances between the tips of a
> tree.
>
>
>
> def get_mrca_array(phylo_obj):
>
> Get a square list of lists (array) listing the mrca of each pair of
> leaves
> (half-diagonal matrix)
>
>
>
> def subset_tree(phylo_obj, list_to_keep):
>
> Given a list of tips and a tree, remove all other tips and
> resulting redundant nodes to produce a new smaller tree.
>
>
> def prune_single_desc_nodes(node):
>
> Follow a tree from the bottom up, pruning any nodes with only one
> descendent
>
>
> def find_new_root(node):
>
> Search up tree from root and make new root at first divergence
>
>
> def make_None_list_array(xdim, ydim):
>
> Make a list of lists ("array") with the specified dimensions
>
>
> def get_PD_to_mrca(node, mrca, PD):
>
> Add up the phylogenetic distance from a node to the specified
> ancestor (mrca). Find mrca with find_1st_match.
>
>
>
> def get_ancestors_list(node, anc_list):
>
> Get the list of ancestors of a given node
>
>
>
>
> def addup_PD(node, PD):
>
> Adds the branchlength of the current node to the total PD measure.
>
>
> def print_tree_outline_format(phylo_obj):
>
> Prints the tree out in "outline" format (daughter clades are
> indented, etc.)
>
>
> def print_Node(node, rank):
>
> Prints the node in question, and recursively all daughter nodes,
> maintaining rank as it goes.
>
>
>
> class Ranges():
>
> Geographic range of a species (collection of points, results
> of classification of those points into regions), GIS-like functions for
> processing them.
>
>
> class Points():
>
> geographic locations of individual collected specimens
>
>
> def readshpfile(fn):
>
> def summarize_shapefile(fn, output_option, outfn):
>
> def point_inside_polygon(x,y,poly):
>
> def shapefile_points_in_poly(pt_records, poly):
>
> def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly):
>
> ==========
>
>
> Here is a summary of the
>
> Nick Matzke wrote:
> > Thanks for the fix!!! A big help. I am currently organizing my
> > functions into several classes and making sure they work, basically the
> > classes look like they will be something like:
> >
> > ==========
> > GbifXml -- for processing GBIF XML results (all of the functions for
> > searching/extracting stuff from xmltree structures)
> >
> > TreeSum -- for processing trees & getting summary statistics etc.
> >
> > Ranges -- Geographic range of a species (collection of points, results
> > of classification of those points into regions), GIS-like functions for
> > processing them
> > Points -- geographic locations of individual collected specimens
> > ==========
> >
> >
> > Brad Chapman wrote:
> >> Hi Nick;
> >> Thanks for the comprehensive update. It sounds like your discussion
> >> with Eric resolved most of the questions about the tree
> >> representation. It's great to see y'all converging on this.
> >>
> >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the
> >>> solution for now, I will reorganize my code accordingly based on
> >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a
> >>> function stripping out node labels. Also I have not forgotten
> >>> previous comments from Brad et al. about bringing the other code up
> >>> to specs. So I will update the BioGeography schedule and overall
> >>> organization I hope to have at the end (with classes/methods etc.,
> >>> instead of just a list-o-functions, which is how my original schedule
> >>> was explicitly laid out), and post an update when done.
> >>
> >> Agreed, and seconding Hilmar that the best thing about open source
> >> code is having others looking at your code. Conversely, feel free to
> >> dig in and fix current code where it is holding you up. To remove
> >> this blocking issue on Nexus and get us rolling again, I
> >> put together an initial fix. You can grab the patch from:
> >>
> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788
> >>
> >> Let us know if this works for your files of interest.
> >>
> >> If this clears up the Nexus issue, it would be great to see the
> >> revised schedule incorporating the refactoring. Sounds like we are
> >> moving in the right direction. Good stuff.
> >>
> >> Thanks,
> >> Brad
> >>
> >
>
> --
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
>
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page:
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
>
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
>
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When people
> thought the earth was spherical, they were wrong. But if you think that
> thinking the earth is spherical is just as wrong as thinking the earth
> is flat, then your view is wronger than both of them put together."
>
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
> 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================
More information about the Biopython-dev
mailing list