[Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion
Nick Matzke
matzke at berkeley.edu
Mon Aug 10 16:23:15 EDT 2009
Hi all...updates...
Summary: Major focus is getting the GBIF access/search/parse module into
"done"/submittable shape. This primarily requires getting the
documentation and testing up to biopython specs. I have a fair bit of
documentation and testing, need advice (see below) for specifics on what
it should look like.
Brad Chapman wrote:
> Hi Nick;
> Thanks for the update -- great to see things moving along.
>
>> - removed any reliance on lagrange tree module, refactored all phylogeny
>> code to use the revised Bio.Nexus.Tree module
>
> Awesome -- glad this worked for you. Are the lagrange_* files in
> Bio.Geography still necessary? If not, we should remove them from
> the repository to clean things up.
Ah, they had been deleted locally but it took an extra command to delete
on git. Done.
>
> More generally, it would be really helpful if we could do a bit of
> housekeeping on the repository. The Geography namespace has a lot of
> things in it which belong in different parts of the tree:
>
> - The test code should move to the 'Tests' directory as a set of
> test_Geography* files that we can use for unit testing the code.
OK, I will do this. Should I try and figure out the unittest stuff? I
could use a simple example of what this is supposed to look like.
> - Similarly there are a lot of data files in there which are
> appear to be test related; these could move to Tests/Geography
Will do.
> - What is happening with the Nodes_v2 and Treesv2 files? They look
> like duplicates of the Nexus Nodes and Trees with some changes.
> Could we roll those changes into the main Nexus code to avoid
> duplication?
Yeah, these were just copies with your bug fix, and with a few mods I
used to track crashes. Presumably I don't need these with after a fresh
download of biopython.
>> - Code dealing with GBIF xml output completely refactored into the
>> following classes:
>>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>> * XmlString (functions for cleaning xml returned by Gbif)
>> * GbifXml (extention of capabilities for ElementTree xml trees, parsed
>> from GBIF xml returns.
>
> I'm agreed with Hilmar -- the user classes would probably benefit from expanded
> naming. There is a art to naming to get them somewhere between the hideous
> RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
> Specifically, you've got a lot of filler in the names -- dbfUtils,
> geogUtils, shpUtils. The Utils probably doesn't tell the user much
> and makes all of the names sort of blend together, just as the Rec/Recs
> pluralization hides a quite large difference in what the classes hold.
Will work on this, these should be made part of the
GbifObservationRecord() object or be accessed by it, basically they only
exist to classify lat/long points into user-specified areas.
> Something like Observation and ObservationSearchResult would make it
> clear immediately what they do and the information they hold.
Agreed, here is a new scheme for the names (changes already made):
=============
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
Also can hold a GbifDarwincoreXmlString record (the raw output returned
from a GBIF search) and a GbifXmlTree (a class for holding/processing
the ElementTree object returned by parsing the GbifDarwincoreXmlString).
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
=============
...description of methods below...
>
>> This week:
>
> What are your thoughts on documentation? As a naive user of these
> tools without much experience with the formats, I could offer better
> feedback if I had an idea of the public APIs and how they are
> expected to be used. Moreover, cookbook and API documentation is something
> we will definitely need to integrate into Biopython. How does this fit
> in your timeline for the remaining weeks?
The API is really just the interface with GBIF. I think developing a
cookbook entry is pretty easy, I assume you want something like one of
the entries in the official biopython cookbook?
Re: API documentation...are you just talking about the function
descriptions that are typically in """ """ strings beneath the function
definitions? I've got that done. Again, if there is more, an example
of what it should look like would be useful.
Documentation for the GBIF stuff below.
============
gbif_xml.py
Functions for accessing GBIF, downloading records, processing them into
a class, and extracting information from the xmltree in that class.
class GbifObservationRecord(Exception): pass
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation
at an individual lat/long point.
__init__(self):
This is an instantiation class for setting up new objects of this class.
latlong_to_obj(self, line):
Read in a string, read species/lat/long to GbifObservationRecord object
This can be slow, e.g. 10 seconds for even just ~1000 records.
parse_occurrence_element(self, element):
Parse a TaxonOccurrence element, store in OccurrenceRecord
fill_occ_attribute(self, element, el_tag, format='str'):
Return the text found in matching element matching_el.text.
find_1st_matching_subelement(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag.
record_to_string(self):
Print the attributes of a record to a string
class GbifDarwincoreXmlString(Exception): pass
class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by
a GBIF search, & processing it to plain text, then an xmltree (an
ElementTree).
GbifDarwincoreXmlString inherits string methods from str (class String).
__init__(self, rawstring=None):
This is an instantiation class for setting up new objects of this class.
fix_ASCII_lines(self, endline=''):
Convert each line in an input string into pure ASCII
(This avoids crashes when printing to screen, etc.)
_fix_ASCII_line(self, line):
Convert a single string line into pure ASCII
(This avoids crashes when printing to screen, etc.)
_unescape(self, text):
#
Removes HTML or XML character references and entities from a text string.
@param text The HTML (or XML) source text.
@return The plain text, as a Unicode string, if necessary.
source: http://effbot.org/zone/re-sub.htm#unescape-html
_fix_ampersand(self, line):
Replaces "&" with "&" in a string; this is otherwise
not caught by the unescape and unicodedata.normalize functions.
class GbifXmlTreeError(Exception): pass
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
__init__(self, xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_xmltree(self):
Prints all the elements & subelements of the xmltree to screen (may require
fix_ASCII to input file to succeed)
print_subelements(self, element):
Takes an element from an XML tree and prints the subelements tag & text, and
the within-tag items (key/value or whatnot)
_element_items_to_dictionary(self, element_items):
If the XML tree element has items encoded in the tag, e.g. key/value or
whatever, this function puts them in a python dictionary and returns
them.
extract_latlongs(self, element):
Create a temporary pseudofile, extract lat longs to it,
return results as string.
Inspired by: http://www.skymind.com/~ocrow/python_string/
(Method 5: Write to a pseudo file)
_extract_latlong_datum(self, element, file_str):
Searches an element in an XML tree for lat/long information, and the
complete name. Searches recursively, if there are subelements.
file_str is a string created by StringIO in extract_latlongs() (i.e., a
temp filestr)
extract_all_matching_elements(self, start_element, el_to_match):
Returns a list of the elements, picking elements by TaxonOccurrence;
this should
return a list of elements equal to the number of hits.
_recursive_el_match(self, element, el_to_match, output_list):
Search recursively through xmltree, starting with element, recording all
instances of el_to_match.
find_to_elements_w_ancs(self, el_tag, anc_el_tag):
Burrow into XML to get an element with tag el_tag, return only those
el_tags underneath a particular parent element parent_el_tag
xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag,
match_el_list):
Recursively burrows down to find whatever elements with el_tag exist
inside a parent_el_tag.
create_sub_xmltree(self, element):
Create a subset xmltree (to avoid going back to irrelevant parents)
_xml_burrow_up(self, element, anc_el_tag, found_anc):
Burrow up xml to find anc_el_tag
_xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
Burrow up from element of interest, until a cousin is found with
cousin_el_tag
_return_parent_in_xmltree(self, child_to_search_for):
Search through an xmltree to get the parent of child_to_search_for
_return_parent_in_element(self, potential_parent, child_to_search_for,
returned_parent):
Search through an XML element to return parent of child_to_search_for
find_1st_matching_element(self, element, el_tag, return_element):
Burrow down into the XML tree, retrieve the first element with the
matching tag
extract_numhits(self, element):
Search an element of a parsed XML string and find the
number of hits, if it exists. Recursively searches,
if there are subelements.
class GbifSearchResults(Exception): pass
class GbifSearchResults():
GbifSearchResults is a class for holding a series of
GbifObservationRecord records, and processing them e.g. into classified
areas.
__init__(self, gbif_recs_xmltree=None):
This is an instantiation class for setting up new objects of this class.
print_records(self):
Print all records in tab-delimited format to screen.
print_records_to_file(self, fn):
Print the attributes of a record to a file with filename fn
latlongs_to_obj(self):
Takes the string from extract_latlongs, puts each line into a
GbifObservationRecord object.
Return a list of the objects
Functions devoted to accessing/downloading GBIF records
access_gbif(self, url, params):
Helper function to access various GBIF services
choose the URL ("url") from here:
http://data.gbif.org/ws/rest/occurrence
params are a dictionary of key/value pairs
"self._open" is from Bio.Entrez.self._open, online here:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
Get the handle of results
(looks like e.g.: <addinfourl at 75575128 whose fp = <socket._fileobject
object at 0x48117f0>> )
(open with results_handle.read() )
_get_hits(self, params):
Get the actual hits that are be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
get_xml_hits(self, params):
Returns hits like _get_hits, but returns a parsed XML tree.
get_record(self, key):
Given the key, get a single record, return xmltree for it.
get_numhits(self, params):
Get the number of hits that will be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)
It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).
xmlstring_to_xmltree(self, xmlstring):
Take the text string returned by GBIF and parse to an XML tree using
ElementTree.
Requires the intermediate step of saving to a temporary file (required
to make
ElementTree.parse work, apparently)
tempfn = 'tempxml.xml'
fh = open(tempfn, 'w')
fh.write(xmlstring)
fh.close()
get_all_records_by_increment(self, params, inc):
Download all of the records in stages, store in list of elements.
Increments of e.g. 100 to not overload server
extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
Extract all of the 'TaxonOccurrence' elements to a list, store them in a
GbifObservationRecord.
_paramsdict_to_string(self, params):
Converts the python dictionary of search parameters into a text
string for submission to GBIF
_open(self, cgi, params={}):
Function for accessing online databases.
Modified from:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
Helper function to build the URL and open a handle to it (PRIVATE).
Open a handle to GBIF. cgi is the URL for the cgi script to access.
params is a dictionary with the options to pass to it. Does some
simple error checking, and will raise an IOError if it encounters one.
This function also enforces the "three second rule" to avoid abusing
the GBIF servers (modified after NCBI requirement).
============
>
> Thanks again. Hope this helps,
> Brad
Very much, thanks!!
Nick
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
More information about the Biopython-dev
mailing list