[Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion

Mon Aug 10 20:23:15 UTC 2009

Hi all...updates...

Summary: Major focus is getting the GBIF access/search/parse module into 
"done"/submittable shape.  This primarily requires getting the 
documentation and testing up to biopython specs.  I have a fair bit of 
documentation and testing, need advice (see below) for specifics on what 
it should look like.

Brad Chapman wrote:
> Hi Nick;
> Thanks for the update -- great to see things moving along.
> 
>> - removed any reliance on lagrange tree module, refactored all phylogeny 
>> code to use the revised Bio.Nexus.Tree module
> 
> Awesome -- glad this worked for you. Are the lagrange_* files in
> Bio.Geography still necessary? If not, we should remove them from
> the repository to clean things up.

Ah, they had been deleted locally but it took an extra command to delete 
on git.  Done.

> 
> More generally, it would be really helpful if we could do a bit of
> housekeeping on the repository. The Geography namespace has a lot of
> things in it which belong in different parts of the tree:
> 
> - The test code should move to the 'Tests' directory as a set of
>   test_Geography* files that we can use for unit testing the code.

OK, I will do this.  Should I try and figure out the unittest stuff?  I 
could use a simple example of what this is supposed to look like.

> - Similarly there are a lot of data files in there which are
>   appear to be test related; these could move to Tests/Geography

Will do.

> - What is happening with the Nodes_v2 and Treesv2 files? They look
>   like duplicates of the Nexus Nodes and Trees with some changes.
>   Could we roll those changes into the main Nexus code to avoid
>   duplication?

Yeah, these were just copies with your bug fix, and with a few mods I 
used to track crashes.  Presumably I don't need these with after a fresh 
download of biopython.

>> - Code dealing with GBIF xml output completely refactored into the 
>> following classes:
>>
>> * ObsRecs (observation records & search results/summary)
>> * ObsRec (an individual observation record)
>> * XmlString (functions for cleaning xml returned by Gbif)
>> * GbifXml (extention of capabilities for ElementTree xml trees, parsed 
>> from GBIF xml returns.
> 
> I'm agreed with Hilmar -- the user classes would probably benefit from expanded
> naming. There is a art to naming to get them somewhere between the hideous 
> RidicuouslyLongNamesWithEverythingSpecified names and short truncated names.
> Specifically, you've got a lot of filler in the names -- dbfUtils,
> geogUtils, shpUtils. The Utils probably doesn't tell the user much
> and makes all of the names sort of blend together, just as the Rec/Recs 
> pluralization hides a quite large difference in what the classes hold.

Will work on this, these should be made part of the 
GbifObservationRecord() object or be accessed by it, basically they only 
exist to classify lat/long points into user-specified areas.

> Something like Observation and ObservationSearchResult would make it
> clear immediately what they do and the information they hold.

Agreed, here is a new scheme for the names (changes already made):

=============
class GbifSearchResults():	

GbifSearchResults is a class for holding a series of 
GbifObservationRecord records, and processing them e.g. into classified 
areas.

Also can hold a GbifDarwincoreXmlString record (the raw output returned 
from a GBIF search) and a GbifXmlTree (a class for holding/processing 
the ElementTree object returned by parsing the GbifDarwincoreXmlString).

class GbifObservationRecord():

GbifObservationRecord is a class for holding an individual observation 
at an individual lat/long point.

class GbifDarwincoreXmlString(str):

GbifDarwincoreXmlString is a class for holding the xmlstring returned by 
a GBIF search, & processing it to plain text, then an xmltree (an 
ElementTree).

GbifDarwincoreXmlString inherits string methods from str (class String).

class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.
=============

...description of methods below...

> 
>> This week:
> 
> What are your thoughts on documentation? As a naive user of these
> tools without much experience with the formats, I could offer better
> feedback if I had an idea of the public APIs and how they are
> expected to be used. Moreover, cookbook and API documentation is something 
> we will definitely need to integrate into Biopython. How does this fit 
> in your timeline for the remaining weeks?

The API is really just the interface with GBIF.  I think developing a 
cookbook entry is pretty easy, I assume you want something like one of 
the entries in the official biopython cookbook?

Re: API documentation...are you just talking about the function 
descriptions that are typically in """ """ strings beneath the function 
definitions?  I've got that done.  Again, if there is more, an example 
of what it should look like would be useful.

Documentation for the GBIF stuff below.

============
gbif_xml.py
Functions for accessing GBIF, downloading records, processing them into 
a class, and extracting information from the xmltree in that class.

class GbifObservationRecord(Exception): pass
class GbifObservationRecord():
GbifObservationRecord is a class for holding an individual observation 
at an individual lat/long point.

__init__(self):

This is an instantiation class for setting up new objects of this class.

latlong_to_obj(self, line):

Read in a string, read species/lat/long to GbifObservationRecord object
This can be slow, e.g. 10 seconds for even just ~1000 records.

parse_occurrence_element(self, element):

Parse a TaxonOccurrence element, store in OccurrenceRecord

fill_occ_attribute(self, element, el_tag, format='str'):

Return the text found in matching element matching_el.text.

find_1st_matching_subelement(self, element, el_tag, return_element):

Burrow down into the XML tree, retrieve the first element with the 
matching tag.

record_to_string(self):

Print the attributes of a record to a string

class GbifDarwincoreXmlString(Exception): pass

class GbifDarwincoreXmlString(str):
GbifDarwincoreXmlString is a class for holding the xmlstring returned by 
a GBIF search, & processing it to plain text, then an xmltree (an 
ElementTree).

GbifDarwincoreXmlString inherits string methods from str (class String).

__init__(self, rawstring=None):

This is an instantiation class for setting up new objects of this class.

fix_ASCII_lines(self, endline=''):

Convert each line in an input string into pure ASCII
(This avoids crashes when printing to screen, etc.)

_fix_ASCII_line(self, line):

Convert a single string line into pure ASCII
(This avoids crashes when printing to screen, etc.)

_unescape(self, text):

#
Removes HTML or XML character references and entities from a text string.

@param text The HTML (or XML) source text.
@return The plain text, as a Unicode string, if necessary.
source: http://effbot.org/zone/re-sub.htm#unescape-html

_fix_ampersand(self, line):

Replaces "&" with "&amp;" in a string; this is otherwise
not caught by the unescape and unicodedata.normalize functions.

class GbifXmlTreeError(Exception): pass
class GbifXmlTree():
gbifxml is a class for holding and processing xmltrees of GBIF records.

__init__(self, xmltree=None):

This is an instantiation class for setting up new objects of this class.

print_xmltree(self):

Prints all the elements & subelements of the xmltree to screen (may require
fix_ASCII to input file to succeed)

print_subelements(self, element):

Takes an element from an XML tree and prints the subelements tag & text, and
the within-tag items (key/value or whatnot)

_element_items_to_dictionary(self, element_items):

If the XML tree element has items encoded in the tag, e.g. key/value or
whatever, this function puts them in a python dictionary and returns
them.

extract_latlongs(self, element):

Create a temporary pseudofile, extract lat longs to it,
return results as string.

Inspired by: http://www.skymind.com/~ocrow/python_string/
(Method 5: Write to a pseudo file)

_extract_latlong_datum(self, element, file_str):

Searches an element in an XML tree for lat/long information, and the
complete name. Searches recursively, if there are subelements.

file_str is a string created by StringIO in extract_latlongs() (i.e., a 
temp filestr)

extract_all_matching_elements(self, start_element, el_to_match):

Returns a list of the elements, picking elements by TaxonOccurrence; 
this should
return a list of elements equal to the number of hits.

_recursive_el_match(self, element, el_to_match, output_list):

Search recursively through xmltree, starting with element, recording all 
instances of el_to_match.

find_to_elements_w_ancs(self, el_tag, anc_el_tag):

Burrow into XML to get an element with tag el_tag, return only those 
el_tags underneath a particular parent element parent_el_tag

xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, 
match_el_list):

Recursively burrows down to find whatever elements with el_tag exist 
inside a parent_el_tag.

create_sub_xmltree(self, element):

Create a subset xmltree (to avoid going back to irrelevant parents)

_xml_burrow_up(self, element, anc_el_tag, found_anc):

Burrow up xml to find anc_el_tag

_xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):

Burrow up from element of interest, until a cousin is found with 
cousin_el_tag

_return_parent_in_xmltree(self, child_to_search_for):

Search through an xmltree to get the parent of child_to_search_for

_return_parent_in_element(self, potential_parent, child_to_search_for, 
returned_parent):

Search through an XML element to return parent of child_to_search_for

find_1st_matching_element(self, element, el_tag, return_element):

Burrow down into the XML tree, retrieve the first element with the 
matching tag

extract_numhits(self, element):

Search an element of a parsed XML string and find the
number of hits, if it exists.  Recursively searches,
if there are subelements.

class GbifSearchResults(Exception): pass

class GbifSearchResults():

GbifSearchResults is a class for holding a series of 
GbifObservationRecord records, and processing them e.g. into classified 
areas.

__init__(self, gbif_recs_xmltree=None):

This is an instantiation class for setting up new objects of this class.

print_records(self):

Print all records in tab-delimited format to screen.

print_records_to_file(self, fn):

Print the attributes of a record to a file with filename fn

latlongs_to_obj(self):

Takes the string from extract_latlongs, puts each line into a
GbifObservationRecord object.

Return a list of the objects

Functions devoted to accessing/downloading GBIF records
access_gbif(self, url, params):

Helper function to access various GBIF services

choose the URL ("url") from here:
http://data.gbif.org/ws/rest/occurrence

params are a dictionary of key/value pairs

"self._open" is from Bio.Entrez.self._open, online here:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open

Get the handle of results
(looks like e.g.: <addinfourl at 75575128 whose fp = <socket._fileobject 
object at 0x48117f0>> )

(open with results_handle.read() )

_get_hits(self, params):

Get the actual hits that are be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)

It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).

get_xml_hits(self, params):

Returns hits like _get_hits, but returns a parsed XML tree.

get_record(self, key):

Given the key, get a single record, return xmltree for it.

get_numhits(self, params):

Get the number of hits that will be returned by a given search
(this allows parsing & gradual downloading of searches larger
than e.g. 1000 records)

It will return the LAST non-none instance (in a standard search result there
should be only one, anyway).

xmlstring_to_xmltree(self, xmlstring):

Take the text string returned by GBIF and parse to an XML tree using 
ElementTree.
Requires the intermediate step of saving to a temporary file (required 
to make
ElementTree.parse work, apparently)

tempfn = 'tempxml.xml'
fh = open(tempfn, 'w')
fh.write(xmlstring)
fh.close()

get_all_records_by_increment(self, params, inc):

Download all of the records in stages, store in list of elements.
Increments of e.g. 100 to not overload server

extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):

Extract all of the 'TaxonOccurrence' elements to a list, store them in a 
GbifObservationRecord.

_paramsdict_to_string(self, params):

Converts the python dictionary of search parameters into a text
string for submission to GBIF

_open(self, cgi, params={}):

Function for accessing online databases.

Modified from:
http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html

Helper function to build the URL and open a handle to it (PRIVATE).

Open a handle to GBIF.  cgi is the URL for the cgi script to access.
params is a dictionary with the options to pass to it.  Does some
simple error checking, and will raise an IOError if it encounters one.

This function also enforces the "three second rule" to avoid abusing
the GBIF servers (modified after NCBI requirement).
============

> 
> Thanks again. Hope this helps,
> Brad

Very much, thanks!!
Nick

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================