[Biopython-dev] Namespace for online resources?

Wed Jan 30 12:20:39 EST 2013

Hi everyone,

Peter, thanks for the links to the archives, I'm starting to get a
grip on why Bio.WWW was deprecated in the first place.

Michiel, thanks for the explanation. My responses are below.

My reply is a bit long, so in the interest of brevity, I'll say first
that I'm in favor of putting TAIR in Bio.TAIR now, for practical
reasons and consistency with similar modules. But I do still have some
slight objections to this approach.

> Bio.WWW was one of those modules that seem a good idea at first, but then failed to gain general acceptance. There are three problems with Bio.WWW:
>
> 1) From the module name, it's not clear what you would find in it. For example, if you want to access the Entrez database, would you first look in Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in Bio.TAIR, or in Bio.WWW?

This seems to be a naming issue, but it does not invalidate the idea
of having one central place for online access. I'll continue to refer
to this module as Bio.WW here, but there may be other more suitable
names, such as Bio.remotedb, Bio.remote.db, Bio.www.db (or something
else) which makes the module a more intuitive place to look in,
right?.

> 2) The modules in Bio.WWW don't have much to do with each other, except that they access the internet. But any given user probably is mainly interested in Entrez, or ExPASy, or some other database, not in all of them at the same time.

We may put a note in the documentation to note this, right? If we are
worried about loading unecessary modules, we can keep the __init__.py
in Bio.WWW empty, and have Entrez, ExPASy, and the others inside
Bio.WWW.

> 3) The flip side of this is that a user accessing e.g. ExPASy would have to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests get more complicated also, as they would span more than one module. Here is an example from Bio.Entrez that accesses the database, and then parses the results:
>>>> from Bio import Entrez
>>>> Entrez.email = "Your.Name.Here at example.org"
>>>> handle = Entrez.einfo() # or esearch, efetch, ...
>>>> record = Entrez.read(handle)
>>>> handle.close()

Since ExPASy's formats may be specific to them, I was thinking their
parsers should also go in Bio.WWW (in this case, Bio.WWW.ExPASy).

Note that at the moment we also have cases where the database entry
retriever and parser lies in different submodules of the code (e.g.
importing Fasta from Bio.Entrez and parsing it with Bio.SeqIO). This
is OK in my opinion, however, as Fasta is a widely used format not
exclusive to Entrez. But for exclusive format like ExPASy's or
Entrez's, it makes sense to keep them in the same module as their
database entry retriever.

> The ultimate question is whether we organize the code in Biopython by their functionality from a user perspective, or by the kind of things they do? Almost all of Biopython is organized according to the former. For example, we don't have a Bio.Parsers module for all the parsers; similarly, we don't have Bio.WWW for internet access.

Hmm..those two points are not necessarily mutually exclusive, right? I
think having a centralized module for online access still makes for a
functional grouping based on a user's perspective.

In the parser's case, it makes sense to organize it the way we do now
as there are so many parsers. But for online access, I think it's
still manageable to put them in one directory. Just to throw the idea
around, we may also have subdirectories for different kinds of online
access (e.g. Bio.www.db for online database access, Bio.www.app for
online tools access like NCBI BLAST or HMMER).

This is not something urgent, but maybe worth thinking / discussing about :).

Cheers,
Bow