[Biopython-dev] Martel timings

Andrew Dalke dalke at acm.org
Fri Oct 13 01:54:46 EDT 2000


Brad:
> Hmmm, one side not about RecordReader. I really like the way you can
> interface with the parsers in multiple ways in the current Biopython
> parsing. I think it is really useful to be able to iterate over a record
> and get the record back, instead of automatically having to parse it

I've got experimental code for that at
http://www.biopython.org/~dalke/SaxRecords.py

It uses a new threads for the callback object and a Queue to send parsed
records back to the iterator interface in the originating thread.
Currently looses memory if the thread doesn't go to completion because
the new thread is sitting there waiting for the queue to empty.

> (I find this useful for pulling a "bad" record out of a big file of
> records). 

That's a bit different topic.   Currently all errors are "fatalError"s,
which under the SAX spec means the parser must stop.  However, SAX
also supports "error"s, which are recoverable.  (Of course, the error
handler can raise an exception, which causes a dead stop in the parser.)

Huh, there's some bugs in the record parser code:

            elif isinstance(result, saxlib.SAXException):
                # Wrong format
                self.err_handler.fatalError(result)
                return
            else:
                # did not reach end of string
                pos = filepos + result
                self.err_handler.fatalError(StateTableEOFException(pos))

That last branch should do a "return" to meet the spec, and as I learned
yesterday, both need to send an "endDocument" event after the fatalError.
And I do need to fix the following to give some sort of error event.

            record = reader.next()  # XXX what if an exception is raised?

> Do you think there is a way to make the RecordReader act similar to
> the Iterators in this regard?

So yes.  Convert the "fatalError" events to "error" and do recovery by
skipping to the next record.  Then have the SaxRecords code, which does the
Iterator-like interface, return the right information for problematical
records.

Umm, what does the Iterator do for bad records?  It looks like it raises
an exception, but allows you to call next() to get the next record?
That's reasonable to me (since I think I can support it :)

I'll work on it; unless you want to do it?

> BTW, I like the StartsWith, EndsWith in the new RecordReader! When I was
> doing the FASTA stuff I couldn't figure out any way to recognize new
> files with only the EndsWith behavior :-).

Thanks!  If you didn't notice, it also plays some tricks to read ahead many
lines, which should give better overall performance.  The File.UndoHandle
isn't as tricky but has better guarantees of where it is in the file and
it allows undos, which Martel doesn't need.  I bet changing the code to
read ahead multiple lines would speed up the existing biopython code.

>>  pruning the expression tree 

> reduce the size of the XML generated and returned.

Good point - I hadn't even thought about how it affect XML output.  I
was more concerned about reducing function call overhead.

                    Andrew
                    dalke at acm.org





More information about the Biopython-dev mailing list