[Biopython-dev] Namespace for online resources?

Sat Feb 2 17:29:57 EST 2013

On Fri, Feb 1, 2013 at 9:14 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Fri, Feb 1, 2013 at 1:54 PM, Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> > Hi Lenna,
> >
> >> Regarding point (2), is your primary concern namespace clutter or
> >> importing efficiency?
> >
> > Regarding point (2), my primary concern is that a Bio.WWW module would
> > group together modules that don't have much in common with each other. I
> > agree to your point that the category of internet access is more
> fundamental
> > than the category of parsers. But still, which modules should then go
> into a
> > Bio.WWW module? Any module whose sole purpose is to use the internet
> (that
> > would exclude Bio.Entrez)? Any module whose main purpose is to use the
> > internet? This would be unclear; for example, Bio.Entrez may or may not
> fall
> > in that category, depending on how you use the module. Any module whose
> > functionality includes internet access? Then if one day we add access to
> the
> > JASPAR database over the internet to Bio.Motif, it would have to move to
> > Bio.WWW.
> >
> > Currently most modules are organized by theme (Bio.Seq, Bio.Motif,
> > Bio.Cluster, Bio.Phylo, Bio.Entrez, etc.). For each theme, we have one
> > module, one chapter in the documentation, one test of unit tests, one
> set of
> > doctests, which I think is a huge advantage both in terms of clarity and
> in
> > terms of user experience.
>
> Also with the theme approach, most (if not all) the themes are likely to
> have some online resources (databases or remote APIs). On those
> grounds it makes sense to keep online motif functionality (like weblogo)
> under Bio.Motif, and so on.
>

I agree.
>From an engineering perspective, it's usually best to organize code around
data types. (To be clear: think classes and structures, not ints and
strings.) The SeqIO, AlignIO, SearchIO, Phylo, Motif, PDB, etc. modules
each have a core data type that serves as the "theme" for the sub-package.
Within the sub-package we can have modules for different file formats, data
transformations/manipulations, web servers, and command-line program
wrappers, and keep all the interdependencies within the same small region
of the code base. Since most users will not read the documentation in its
entirety (if at all), this also makes it easier to look up how to do things
with the data type in question.

The core data type for a WWW module would be a network handle, I suppose --
but that's already part of the Python standard library.

I've suggested before that we can justify the current placement of
sequence-related modules at the top level, rather than under a new "Seq"
sub-package, by considering sequences to be the default/implicit data type.
As we've covered, many online resources can serve up several different data
types, although sequences are probably the most common. In terms of
namespace clutter, perhaps I've gotten too used to R, but I don't think
we've reached the point where the number of modules and functions visible
from the top level harms the user experience.

In the specific case of Kevin's TAIR code for fetch Arabidopsis sequences,
> Bio.TAIR (lower case?) is consistent with current usage. Somewhere under
> Bio.Seq* also seems sensible to me, as I wrote at the start of this thread.
>

Bio.TAIR or Bio.Seq.TAIR or perhaps Bio.Seq.WWW.TAIR seem sensible to me,
too. No preference on casing.

-Eric