[Biopython-dev] PhyloXML read/parse functions and handles
Peter
biopython at maubp.freeserve.co.uk
Sat May 9 09:06:15 EDT 2009
Hi Eric,
Are you happy to have feedback on your PhyloXML code in public? In
this case I wanted to make a fairly general observation about parsing
files using handles, so I have cc'd the dev list.
I just had a look at the stub in Bio/PhyloXML/__init__.py and
Bio/PhyloXML/Parser.py on your github branch,
http://github.com/etal/biopython/tree/phyloxml
The convention we are following in Biopython for parsing functions is
as follows:
read(handle, ...) - returns a single object (e.g. a tree in your case)
parse(handle, ...) - returns an iterator (e.g. returning multiple trees)
[This naming convention is arbitrary, but we should try to stick to it
in all new parsers for consistency.]
In Bio/PhyloXML/Parser.py you have a parse() sub function which
according to the comment appears to return a single tree. If so, this
should be a read() function instead of a parse() function.
You seem to have a read() stub function in Bio/PhyloXML/__init__.py
which returns a single tree (good), but takes a (zip) filename (not a
handle - bad). Taking just a filename prevents using a whole range of
handle objects as input - e.g. StringIO handles, URL handles, piped
output from a command line tool etc. This flexibility is why we focus
on dealing with handles for parsers.
On a related point, you should leave unzipping the file to the user -
this is not specific to dealing with XML tree files. Plus, in
addition to zip files (i.e. pkzip/winzip format), there are other
compressed fileformats to consider, such as tarballs. They too can be
opened and compressed on the fly as a handle (e.g. see the gzip python
library). By taking a handle as the input your parser can then be
used with any of these import sources.
Peter
P.S. Finally, a more general note about a possible "Bio.TreeIO"
module. For simple Newick trees, a single file can contain one or more
trees (e.g. from bootstrapping). A tree can be split over multiple
lines (but may be one long line), but multiple trees can be split up
because they should all have a semicolon terminator. For Nexus files,
I'm not sure off hand if there can be more than one tree. If you are
going to use the Tree objects from Bio.Nexus, then we could provide a
"Bio.TreeIO" module with read/parse/write methods coping with
"newick", "nexus", "phyloxml" formats, all using the same tree
objects.
More information about the Biopython-dev
mailing list