[Biopython-dev] Code review request for phyloxml branch

Sat Jan 9 10:15:56 EST 2010

--- On Fri, 1/8/10, Eric Talevich <eric.talevich at gmail.com> wrote:
> Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava
> do something completely different.
> 
> I had the impression that pairing modules Foo & FooIO
> was an emerging convention for organizing very general
> data types being fed by a variety of file formats, while
> a single module Foo indicated support
> for a particular program or source, like Entrez.

I think a workable convention, which is already followed by many Biopython module, is the following:

1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.).

2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly.

3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information.

This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object.

> But I think it would
> be even cleaner if each Foo simply had a Foo.IO (or foo.io)
> sub-module organizing the I/O for multiple file formats where
> applicable.

I agree.

> The TreeIO.* namespace is not crowded -- just read, write,
> parse, convert. If that directory is moved under Bio.Tree and
> renamed to IO or io, then Bio.Tree would still seem reasonably
> intuitive if __init__.py contained:
> 
> from io import *
> from utils import *
> 
> Then "from Bio import Tree" would be enough for most uses.

Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.

Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.

> > perhaps something different like Bio.Phylo instead?
> 
> Sure, that sounds promising.

I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing.

--Michiel