[Biopython-dev] Martel timings

Mon Oct 30 02:58:13 EST 2000

Brad:
>Hmmmmm, I admit I am having lots of problems groking this -- I think
>my mind must be really cloudy. I just can't exactly see why using
>threads is the best way.

It's the solution for a somewhat different problem.  Suppose you have
an arbitrary SAX interface, where you cannot change the event
generation code, and want to turn it into an iterator interface.

One way to implement it is to store the events on a list then after
the callbacks are finished, scan it to produce the records.  This has
the problem of storing all of the events before processing them, so
there can be some memory problems.

Another way is to spawn off a new thread and do the processing there.
When a record is processed, send it over to the original thread.  (I
believe this would work even better using Stackless Python.)  This is
the most general but is (as you noticed) more complex.

I said "somewhat different problem" because we have control over the
Martel definitions.  There's already a specialization (RecordParser)
which has better memory usage for record oriented data.  By definition,
that means it can be used to convert all of a record's callback events
into a list of events, as in the first possibility, then scan the list
to create records.

So what I've done is add a new method to the expression objects called
"make_iterator" just like they have a "make_parser" method.  The
make_iterator takes a string, which is the tag name used at the start&end
of the record.  The object returned parse(...), parseString(...) and
parseFile(..) just like the parser object returned from "make_parser",
except it also takes a second parameter which is used to make records.
That description is easier to understand as code:

   iterator = format.make_iterator("swissprot38_record")
   for record in iterator.parseString(text, make_biopython_record):
       ...

The implementation uses an EventStream protocol.  An EventStream has a
'.next()' method, which returns a list of events.  If there are no events,
it returns None.  In the standard case, the EventStream converts all of
the input into a list of events and returns it.  For a record reader,
each call of next reads a record and returns its events.

The EventStream object is passed to Iterator class's constructor, which
is a forward iterator for reading records (the 'for record in ...' part
of the above).  When *its* .next() is called, it starts processing the
list of available events, calling the EventStream if more events are needed.
As it scans the list, it looks for the start and end tags.  Everything
inside of those tags are passed to the SAX parser object created by the
factory object passed in (the 'make_biopython_record').  It also sends
startDocument/endDocument events.   The Iterator's next() method returns
the created SAX parser objects.

Again, it's easier to use than describe.

This approach, BTW, is vaguely similar to the pulldom of Paul Prescod's.

The nice thing about the "make_iterator" API is that is supports both
this event stream approach and also allows threads, if there's no way to
modify the parser code.

>Andrew:
>> Umm, what does the Iterator do for bad records?  It looks like it
>> raises
>> an exception, but allows you to call next() to get the next record?
>> That's reasonable to me (since I think I can support it :)
>
>Yup, that's the way Iterator works, which would be very nice. It would
>be a serious pain to have a huge parse completely die near the end
>because of a single bad record.

After reflection, I've come to a different conclusion about how to handle
bad records.  It's really easy to make a new format which handles swissprot
records as well as errors.

format = ParseRecords(swissprot38.format |
                      Rep(Group("bad_record",
Re("^((?!//)[^\n]*\n)*//\n"))),
                      EndsWith("//\n"))

(I don't have the source code available now, so the syntax is probably
a bit off.)

Then the SAX parser for records just needs to know how to handle
swissprot38_record and bad_record records.  I like this because I like
strict code, where you have to be explicit to tell it how to ignore errors.

Plus, if you want to do some recovery with data extraction, it could
switch to a different syntax which might not be as strict (like
'(?P<name>..)   (?P<text>[^\n])\n')

>> I'll work on it; unless you want to do it?
>
>I can try, although I'm not exactly positive about the best way to
>proceed.

I've got the iterator code mostly working.  I'm doing documentation and
adding more regression tests.  How about when I finish I send you a
version to test out?  Don't ask when :(

                    Andrew
                    dalke at acm.org