[Biopython-dev] Martel timings

Brad Chapman chapmanb at arches.uga.edu
Sun Oct 29 11:35:56 EST 2000


[Martel thread]
I wrote:
> > Hmmm, one side note about RecordReader. I really like the way you can
> > interface with the parsers in multiple ways in the current Biopython
> > parsing. I think it is really useful to be able to iterate over a
> record
> > and get the record back, instead of automatically having to parse it

Andrew:
> I've got experimental code for that at
> http://www.biopython.org/~dalke/SaxRecords.py
> 
> It uses a new threads for the callback object and a Queue to send
> parsed
> records back to the iterator interface in the originating thread.
> Currently looses memory if the thread doesn't go to completion because
> the new thread is sitting there waiting for the queue to empty.

Hmmmmm, I admit I am having lots of problems groking this -- I think
my mind must be really cloudy. I just can't exactly see why using
threads is the best way. The way that Biopython parsers work is:

1. Get a handle with the next record in a big file.

2. If a parser is passed, parse the handle and return the results.

   Otherwise (no parser), return the handle itself.

This seems to make more sense (ie. simpler for my simple mind :-), but 
I'm not sure -- what are your thoughts?

[helpful description of errors in Martel]
I wrote:
> > Do you think there is a way to make the RecordReader act similar to
> > the Iterators in this regard?

Andrew:
> So yes.  Convert the "fatalError" events to "error" and do recovery by
> skipping to the next record.  Then have the SaxRecords code, which
> does the
> Iterator-like interface, return the right information for
> problematical
> records.
> 
> Umm, what does the Iterator do for bad records?  It looks like it
> raises
> an exception, but allows you to call next() to get the next record?
> That's reasonable to me (since I think I can support it :)

Yup, that's the way Iterator works, which would be very nice. It would 
be a serious pain to have a huge parse completely die near the end
because of a single bad record.

There is  also the issue I was just discussing with Jeff about getting 
back bad records and trying to find why they are bad (ie. in BLAST
output, but I would imagine it might be helpful in other cases as well 
-- badly formatted GenBank entries that the parser doesn't like?).
 
> I'll work on it; unless you want to do it?

I can try, although I'm not exactly positive about the best way to
proceed. This is related (at least in my mind) with the other problem
I was discussing with Jeff...

[cool new stuff in the RecordReader]
> Thanks!  If you didn't notice, it also plays some tricks to read ahead
> many lines, which should give better overall performance.  The
> File.UndoHandle isn't as tricky but has better guarantees of where 
> it is in the file and
> it allows undos, which Martel doesn't need.  I bet changing the code
> to read ahead multiple lines would speed up the existing biopython code.

Yeah, this stuff is very cool. My mind is still kind of blown away by
both this and Jeff's File.UndoHandle stuff -- it is really nifty that
you can do so much cool stuff with the handles!

Brad




More information about the Biopython-dev mailing list