[BioPython] Bug in Bio.GenBank.index_file()

Andrew Dalke dalke@acm.org
Tue, 15 May 2001 11:51:25 -0600


>we have come across a bug in the Bio.GenBank.index_file() function.
 ...
>The problem is that Bio.GenBank.index_file() directly accesses the
>positions member of a Martel.RecordReader, apparently assuming to find
>file positions of record starts there.

Ahhh.  I understand where the confusion might have arisen.  I wanted the
Martel code to be as lean as possible so I didn't keep track of
positions on the assumption that downstream code could keep a running
sum of the characters().  But as you point out, it *appears* to keep
track of record positions - which only fails when there is more than
SIZEHINT data - so especially given the lack of appropriate documentation,
people may that position is valid.

>I would offer a fix for the bug, but I am not sure how to do this,

>The direct access to the positions list of an instance of a class
>belonging to another module is a hack that bypasses modularity.

Yes and no.  In C++ that may be true, but Python has __getattr__
which can be used to make attribute lookups implementation
independent.

>The fact
>that this hack was committed probably indicates need for an additional
>interface of the Martel.RecordReader class

Really what I would like to have is full support for SAX Location
information, which is supposed to allow client code full ability
to get the current position of any event - with line and offset
in line resolution.  I didn't do that because tracking line numbers
is a more complicated and expensive task, so I punted and used the
"clients must count characters()" solution.

I would prefer record events to be identical to other events,
and have the use of RecordReader only be an optimization.  This
means if there is a way to get the character position from the
record-level event there needs to be a way to get it for all events.
It is useful data, but there are some people who don't need it.
My concern is how much of a performance hit it is.  OTOH, there
are a few performance tricks I haven't added, which should more
than balance the overhead.

And it does seem needed.

How should the information be passed back to the client?  I was
thinking it could be passed as attrs in the startElement method,
but that doesn't help because you would like to know when it
ends, and endElement doesn't pass that information.

The SAX way to track locations is to pass a Locator object to
the DocumentHandler before parsing.  The Locator can be called
to getLineNumber() or getColumnNumber(), and those can be -1
if the operation is not supported.

I could extend that to provide a getCharacterPosition() and
have that be updated before every event.  I don't like its expense
of needing a function call (setCheracterPosition) for every event.
OTOH, I could bend the API even more and initialize the locator
to know how to get the correct information as needed rather than
being told via a method call.

A nice thing about the Locator is if the DocumentHandler wasn't
told to use one it could potentially be optimized to not track
byte positions.

I don't like this.  My XML book ("The XML Companion") says that
the Locator is updated "each time it finds an error" which implies
that it may not be used for all elements.  That could be a
misinterpretation or a change between SAX1, which the book covers,
and SAX2, which Martel uses.

The SAX2 doc at
http://www.megginson.com/SAX/Java/javadoc/org/xml/sax/Locator.html
says the locations are tied to any SAX event, and the SAX1 doc says
the same thing, so I'm going to assume my book wasn't precise enough
for me.

Okay, so does a modified Locator object with special support for
getting the byte position work for everyone?

The byte position will be returned as a long integer because Martel
should be able to work on 32 bit systems compiled for LONG_LONG filepos
support.  Though I have no way to test that.

>But while this should be fairly easy
>to write for Unix & Co. only, I'm concerned that newline substitutions
>done in other operating systems might become an additional source of
>discrepancies between chunk and file coordinates.

Only open files in binary mode.  Martel understands all three common
newline conventions and doesn't do any character conversion downstream,
so in binary mode all input characters are passed to the characters()
call, making the count always correct.

                    Andrew
                    dalke@acm.org