[Biopython-dev] Questions about code contributions

Rob Knight rob at spot.colorado.edu
Fri Feb 21 13:06:06 EST 2003

Hi Jeff,

Thanks for the thoughtful reply. As I said initially, we _are_ interested
in contributing (especially in the areas that are currently missing), but
need to make sure that the benefits outweigh giving up control over the
code we'll need.

> Parts of the core will not change any more, while other stuff is
> currently undergoing significant reorganization.

Which parts are now fixed, and which parts are in flux? I have read the
last couple of months' posts on the mailing list, but it would be great to
get the current status in one place.

> I know of several groups using Biopython with Jython.

Are any of them active on this list? I'd definitely be interested in
hearing any experiences people have had with this.

Is there anything major besides the parsing framework that depends on C
libraries? How difficult would it be to translate into Java the parts of
mxTextTools that Martel requires?

> It is hard to say how much participating in Biopython will benefit you.
>   If you want someone to write your code, I'm fairly confident that's
> not going to happen.

That's not what we're looking for, but discussion, debugging, and other
support would definitely be useful.

We'll come up with a more concrete plan for how we want to organize our
code over the next couple of weeks, and post it to the list for
discussion. At worst, it's easy to make up dummy modules that just
translate between different naming conventions. I definitely do appreciate
the amount of work that's gone into the Biopython project, and recognize
that there are probably good reasons for a lot of the things that I don't
currently understand.

> However, I think it is unfair to attribute the brittleness in the
> documentation to bugs in the code.

One illustrative example:

Page 32 of the Tutorial describes how to set up an NCBIDictionary with the
default settings (for nucleotide sequences). We were trying to get some
protein sequences. When we passed in peptide accession numbers, the error
message indicated that the most likely problem was that the accession
numbers were not in the database. However, they were present when we
looked them up manually through NCBI's web site.

>From the Tutorial:

>>> from Bio import GenBank
>>> ncbi_dict = GenBank.NCBIDictionary()
>>> print ncbi_dict['6273291']
LOCUS       AF191665                 902 bp    DNA     linear   PLN
DEFINITION  Opuntia marenae rpl16 gene; chloroplast gene for chloroplast

...many more lines: works fine.

>>> print ncbi_dict['AAN12123']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.2/site-packages/Bio/GenBank/__init__.py", line
1541, in __getitem__
    raise KeyError, x
KeyError: ERROR, possibly because id not available?

>>> new_ncbi_dict = GenBank.NCBIDictionary(database='protein')
>>> print new_ncbi_dict['AAN12123']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.2/site-packages/Bio/GenBank/__init__.py", line
1541, in __getitem__
    raise KeyError, x
KeyError: ERROR, possibly because id not available?

>>> new_ncbi_dict_2 = GenBank.NCBIDictionary(database='protein',
>>> print new_ncbi_dict_2['AAN12123']
LOCUS       AAN12123                 438 aa            linear   INV
DEFINITION  CG5605-PF [Drosophila melanogaster].
...many more lines: works fine

In other words, initializing GenBank.NCBIDictionary specifying only a
database does not work: the format must be specified as well, and the
docstring doesn't say what the valid formats are.

This is not fixed even in the cvs version. Looking at
Bio/GenBank/__init__.py in cvs:

[first line is 1470]

    def __init__(self, database='sequences', format="gb", delay=5.0,
        """NCBIDictionary([database][, delay][, parser])

        Create a new Dictionary to access GenBank.  Valid values for
        database are 'genome', 'nucleotide', 'protein', 'popset', and
        'sequences'.  delay is the number of seconds to wait between
        each query (5 default).  parser is an optional parser object
        to change the results into another form.  If unspecified, then
        the raw contents of the file will be returned.

        from Bio.WWW import RequestLimiter
        self.parser = parser
        self.limiter = RequestLimiter(delay)
        self.database = database
        if format:
            self.format = format
        elif self.database == 'nucleotide':
            self.format = 'gb'
        elif self.database == 'protein' or self.database == 'popset':
            self.format = 'gp'
            self.format = 'native'

The code to set the format is never executed, because format is set to
'gb' as a default parameter. The fix is trivial:

    def __init__(self, database='sequences', format=None, delay=5.0,

This always returns a result, albeit in native format (which breaks the
Tutorial example). To preserve the Tutorial example, set the default
database to 'nucleotide' instead of 'sequences'. Another option would be
to try to autodetect whether a particular accession number is protein or
nucleotide and return in the appropriate gb or gp format, but this would
take somewhat more effort.

Our experiences trying to follow and modify the recipes in the Tutorial
suggest that this kind of thing is fairly common. We will file bug reports
if time permits, but it does take significant effort to write them up and
verify that the patches work (especially given the state of the tests).

Also, we were surprised to find all this code lurking in __init__.py in
the first place. Is there a specific motivation for this design decision?

I think that the proposal of breaking up the Tutorial into sections might
help a lot with this sort of thing. It might also help to make the
specific examples in the Tutorial into unit tests that can be conveniently
run when the code is updated so that it's easy to see what breaks...

Anyway, I will keep you posted as our specific plans mature.


More information about the Biopython-dev mailing list