[Biopython-dev] Bio.WWW; database access

Tue Nov 13 04:10:41 EST 2007

Michiel De Hoon wrote:
> Hi everybody,
> 
> Recently Eric Gibert wrote to us about Bio.GenBank.NCBIDictionary failing
> with Biopython 1.44. We already have a fix for this error in CVS, but when I
> was looking into this bug in more detail I started wondering about the way
> database access is organized in Biopython.

See also bug 2393 for the background to this discussion.
http://bugzilla.open-bio.org/show_bug.cgi?id=2393

> Currently, code to access NCBI Entrez exists in three places:
> 1) Bio.WWW.NCBI
> 2) Bio.GenBank (in NCBIDictionary)
> 3) Bio.EUtils
> 
> Bio.WWW contains three more submodules for database access:
> 1) Bio.WWW.ExPASy, to access Swissprot, Prodoc, Prosite
> 2) Bio.WWW.InterPro, to access InterPro
> 3) Bio.WWW.SCOP, to access the SCOP database
> 
> The parsers for these modules are in a different location:
> 1a) Bio.SwissProt
> 1b) Bio.Prosite
> 1c) Bio.Prosite.Prodoc
> 2) Bio.InterPro
> 3) Bio.SCOP
> 
> To me, it seems odd that the code for database access and the code to parse
> files downloaded from the database are in different locations. For example,
> when I was working on Bio.GenBank, it did not occur to me that such code
> might already exist in Bio.WWW.

My initially reaction was different - having noticed there was a WWW 
module (it caught my eye being the last directory) I initially looked 
there for online resources.

> Now, Bio.WWW.SCOP is a very small module (64 lines total), and
> Bio.WWW.InterPro seems to be out of date. With Bio.WWW.NCBI containing
> functionality that also exists elsewhere in Biopython, having a separate
> Bio.WWW module doesn't seem to be optimal in terms of code organization. I'd
> prefer to have the code for database access together with the respective code
> for parsing. 
> 
> Any opinions?

I do agree that having things in two places is not optimal for new 
users, and on balance having code to access the online resource 
associated with a file format in the same place as the parsers seems 
reasonable.

However, what about the fact that some online resources (e.g. GenBank) 
will return several sorts of data (e.g. journal references and 
sequences) and/or in a range of file formats (e.g. GenBank, Fasta, XML, 
...).  In this situation, having the online interface separate from the 
format parsers makes some sense.  I am perhaps being a devil's advocate 
here.

Peter