[Biopython-dev] Bio.WWW; database access
Peter
biopython-dev at maubp.freeserve.co.uk
Tue Nov 13 04:10:41 EST 2007
Michiel De Hoon wrote:
> Hi everybody,
>
> Recently Eric Gibert wrote to us about Bio.GenBank.NCBIDictionary failing
> with Biopython 1.44. We already have a fix for this error in CVS, but when I
> was looking into this bug in more detail I started wondering about the way
> database access is organized in Biopython.
See also bug 2393 for the background to this discussion.
http://bugzilla.open-bio.org/show_bug.cgi?id=2393
> Currently, code to access NCBI Entrez exists in three places:
> 1) Bio.WWW.NCBI
> 2) Bio.GenBank (in NCBIDictionary)
> 3) Bio.EUtils
>
> Bio.WWW contains three more submodules for database access:
> 1) Bio.WWW.ExPASy, to access Swissprot, Prodoc, Prosite
> 2) Bio.WWW.InterPro, to access InterPro
> 3) Bio.WWW.SCOP, to access the SCOP database
>
> The parsers for these modules are in a different location:
> 1a) Bio.SwissProt
> 1b) Bio.Prosite
> 1c) Bio.Prosite.Prodoc
> 2) Bio.InterPro
> 3) Bio.SCOP
>
> To me, it seems odd that the code for database access and the code to parse
> files downloaded from the database are in different locations. For example,
> when I was working on Bio.GenBank, it did not occur to me that such code
> might already exist in Bio.WWW.
My initially reaction was different - having noticed there was a WWW
module (it caught my eye being the last directory) I initially looked
there for online resources.
> Now, Bio.WWW.SCOP is a very small module (64 lines total), and
> Bio.WWW.InterPro seems to be out of date. With Bio.WWW.NCBI containing
> functionality that also exists elsewhere in Biopython, having a separate
> Bio.WWW module doesn't seem to be optimal in terms of code organization. I'd
> prefer to have the code for database access together with the respective code
> for parsing.
>
> Any opinions?
I do agree that having things in two places is not optimal for new
users, and on balance having code to access the online resource
associated with a file format in the same place as the parsers seems
reasonable.
However, what about the fact that some online resources (e.g. GenBank)
will return several sorts of data (e.g. journal references and
sequences) and/or in a range of file formats (e.g. GenBank, Fasta, XML,
...). In this situation, having the online interface separate from the
format parsers makes some sense. I am perhaps being a devil's advocate
here.
Peter
More information about the Biopython-dev
mailing list