[Biopython-dev] PhyloXML read/parse functions and handles

Sun May 10 09:22:21 UTC 2009

On Sun, May 10, 2009 at 6:22 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> The function currently allows either filenames or file handles as the source
> because ElementTree.iterparse() also accepts either object as a source. The
> read() function could "assert not isinstance(infile, str)", I guess...

Interesting - ReportLab also allows filenames or handles.  If this truely is a
widespread or growing trend in Python libraries, maybe we should do this
as well.

> The existing Java implementation in Forester/ATV has even more magic,
> automatically performing Zip extraction if the given filename ends with
> '.zip'. Since this looks like it will be a pretty common use case, at least
> for big files, I thought it would be nice to also offer a wrapper function
> that takes a filename and does the Right Thing -- that's what
> __init__.read() does currently. Is there a precedent for this in Biopython?

Note that Bio.Nexus does this already, making it a bit inconsistent with the
rest of Biopython.  I guess no one noticed or commented back when it was
added.

> The name should probably be something different; in the pdbtidy branch I
> used load(), to match the Pickle module, since the wrapper function does
> more than just parse or read a file.
>
> So how about:
>
> from Bio import PhyloXML
> handle = open('somefile', 'r') # file-like object from any source
> tree = PhyloXML.read(handle)
>
> Equivalent to:
>
> from Bio import PhyloXML
> tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?
>
> Or, to be explicit, offer a read_zip or load_zip function.

I prefer the more explicit read_zip idea, your would also have an optional
argument for the filename within the zip file.  However, I'm not yet
convinced we need this function.

> I'd leave well enough alone, but the incantation to extract a character
> stream from a single zipped file is kind of unintuitive, and one of the
> three example files on phyloxml.org is already zipped. (I should really
> ask Christian Zmasek about this to see if that's a real convention or
> not.)

Do you want to find out if this really is a phyloxml.org convention first?

If this is their convention, it surprises me they didn't go for .gz files,
which in my experience are more widley used in Bioinformatics (e.g.
at the NCBI and PDB).  These are supported cross platform and hold
one single file (often a tarred file containing multiple files).  A zip file
can hold multiple files, which means you have to make extra
asumptions (e.g. you are using the first file in your code).

>> P.S. Finally, a more general note about a possible "Bio.TreeIO"
>> module. For simple Newick trees, a single file can contain one or more
>> trees (e.g. from bootstrapping).  A tree can be split over multiple
>> lines (but may be one long line), but multiple trees can be split up
>> because they should all have a semicolon terminator.  For Nexus files,
>> I'm not sure off hand if there can be more than one tree.  If you are
>> going to use the Tree objects from Bio.Nexus, then we could provide a
>> "Bio.TreeIO" module with read/parse/write methods coping with
>> "newick", "nexus", "phyloxml" formats, all using the same tree
>> objects.
>>
>
> OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
> parser working first before attempting integration, but if some of Bio.Nexus
> can be reused in that process, great.

Brad is right - getting a simple PhyloXML parser working is the first step.
It would be sensible to look at the Bio.Nexus tree structure though.

> I'm about to go dark from the end of this week until 3/31 (getting
> married, yaknow), but I'll fix all this code when I get back and have
> access to git again.

Congratulations - it looks like you've got a proper break sheduled as well :)

Peter