[BioPython] 1.00a2 bug report (+ kludge fix)

Jan T. Kim kim@inb.mu-luebeck.de
Wed, 1 Aug 2001 17:27:09 +0200


Hi all,

earlier this week, I installed biopython-1.00a2... and as a first result,
a student reported that the Bio.GenBank.NCBIDictionary has developed
a habit of eating the first five lines off each record. This bug is
quite severe with respect to BioPython's usability / functionality,
but fortunately, it should be rather simple to fix it: Investigating
this matter revealed the following:

    * Bio.GenBank.NCBIDictionary.__getitem__() uses Bio.WWW.NCBI.query()

    * Bio.WWW.NCBI.query() uses Bio.WWW.NCBI._open()

    * Bio.WWW.NCBI._open() obtains a handle for an HTTP response, turns
      that into an UndoHandle and gets the first 5 lines of the response
      to check for an error (Bio/WWW/NCBI.py, line 131). The 5 lines
      read are fed back into the UndoHandle using the saveline() method.
      (See bottom of this message for a remark on this.)

    * The saveline() method adds those line to the private member list
      _saved of the UndoHandle instance. However, the read() method
      just ignores them, so they are not retrieved even though they have
      been put back.

The problem is with the UndoHandle::read() method:

    def read(self, size=-1):
        saved = ''
        while size > 0 and self._saved:
            if len(self._saved[0]) <= size:
                size = size - len(self._saved[0])
                saved = saved + self._saved.pop(0)
            else:
                saved = saved + self._saved[0][:size]
                self._saved[0] = self._saved[0][size:]
                size = 0
        return saved + self._handle.read(size)

The body of the while loop is never executed (unless the optional second
argument is specified, which is not the case in the context outlined
above). The problem appears to be that a size of -1 is supposed to mean
"infinity", but that does not actually come out in the implementation
of the read() method.

As a kludge, I modified the method as follows:

    def read(self, size=-1):
        saved = ''
	if size == -1 :
	    saved_size = 2147483647
	else :
	    saved_size = size
        while saved_size > 0 and self._saved:
            if len(self._saved[0]) <= saved_size:
                saved_size = saved_size - len(self._saved[0])
                saved = saved + self._saved.pop(0)
            else:
                saved = saved + self._saved[0][:saved_size]
                self._saved[0] = self._saved[0][saved_size:]
                saved_size = 0
	# print 'saved:', saved
        return saved + self._handle.read(size)

This hack repairs the script which was broken by upgrading to 1.00a2, but
I hope it's needless to say that this is not a proper fix. I just post
this as a proof of my bug analysis.

As a remark regarding distinction of error messages versus GenBank records
in the HTTP response, it might be a better method to look at the MIME type
of the response. This is text/plain in case of a successful record retrieval,
but text/html if errors are reported. This might serve as as a basis for
additional / improved methods for error detection.

Greetinx, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |  *NEW* -->  email: kim@inb.mu-luebeck.de                           |
 |  *NEW* -->  WWW:   http://www.inb.mu-luebeck.de/staff/kim.html     |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*