[Biojava-l] TokenParser.TPStreamParser
David Huen
David Huen <smh1008@cus.cam.ac.uk>
Sun, 10 Jun 2001 16:07:54 +0100 (BST)
On Sun, 10 Jun 2001, Thomas Down wrote:
> On Sun, Jun 10, 2001 at 02:40:33PM +0100, David Huen wrote:
> > The above appears to fubar when fed sequences with whitespace.
> > Unfortunately, these are common with XML derived sequences. Would anyone
> > object to a modification such that whitespace characters are ignored
> > rather than worthy of an exception?
>
> As I recall, I wrote TPStreamParser to be compatible with
> the existing TokenParser. I'd actually be kind-of reluctant
> to add whitespace ignoring at this level, because it effectively
> means that you can /never/ use whitespace characters as tokens
> (which is probably a very bad idea, but it still worries me a
> little to completely rule it out.).
>
OK.
> How about the following alternative strategy:
>
> I presume you're talking about driving a StreamParser from a
> SAX or StAX event source. The S[t]AX listener will recieve
> arrays of characters. You can then identify blocks of
> non-whitespace within this array, and pass them to the
> StreamParser.characters(char[], int, int) method. No
> need to copy the characters into another array or anything,
> so it should be quite efficient.
OK, I'll do that. That's no problem.
I have encountered another problem with StaxContentHandlerBase.
There is a method defined in the API:-
public void characters(char[] ch, int start, int end)
When given elements with lots of data, it is usually called with start = 0
and end = 16384. Any attempt to access char[16834] results in an
immediate exception which suggest to me that end is really length rather
than index of highest element within char[]. Is that correct?
Thanks,
David Huen, Dept. of Genetics, Univ. of Cambridge