[Biopython-dev] Background process handling
Andrew Dalke
dalke at acm.org
Fri Sep 29 06:20:35 EDT 2000
Thomas Sicheritz-Ponten <thomas at cbs.dtu.dk>:
>How should I solve this ?
>a) fork and exec*
>b) popen
>c) write to temporary file, start blast into new file, continuously read
> new file
>d) use an expect module
>e) threads
>f) a combination with a LOT of updates ?
>g) ???
There are two usual approaches - select based and thread based.
"select" is a mechanism to tell if something happend on a file handle.
Under unix, nearly everything is a file handle (files, network I/O, X).
Under Windows it only works with sockets. See the select module.
Using selects in a command line application works something like this.
Have a central list of "jobs", each of which is a select'able object.
(Warning: you are limited to the number of file descriptors on a
machine, which also includes stdin, stdout and stderr. On some machines
this may be 64 or lower, though lower is quite rare these days.)
The outermost loop of your program does a select on the task list, to
see which had changes. From this it maps the activity information to
an action, which is most likely a callback for that object. The function
can read text from the descriptor, remove the task from the list of
tasks, or whatever.
With a GUI things become a bit more complicated. Some GUIs want to
be the main event loop, but realize that other people use select based
multitasking, so provide a way to register file descriptors and callbacks.
Other GUIs act more like a library, and give you a way to get a (possible
list of) file descriptor for the GUI, which you use for your event loop.
I believe Tk is of the first form, but I've never really looked into it.
The GUI documentation should go into the details.
The select approach can be used with a) os.popen, b) fork/exec (see the
popen2 module for one way) and c) reading a file using the regular open.
Actually, b) is used as the basis for both a) and the system call you
need for c).
I've never used d) so cannot comment.
If you really want to get into select based systems, take a look
at Sam Rushing's Medusa, part of which is included in Python as
asyncore and asynchat. The Design Pattern for this approach is,
I believe, called the "Reactor."
The other usual approach, and the one often considered more modern,
is to use threads. This is what you almost must do if you want to
run under MS Windows. Threads is to select as preemtive multitasking
is to non-preemptive.
The mechanism for threads is conceptually simpler than select: "start
this function and let it do whatever it needs to do while I work on
other things." L ikely you will want to create a thread task object
which takes the BLAST input parameters and runs blast. The thread
will use the same methods as select (os.popen, fork/exec, etc.) but
instead of using select to tell if the status changed, it just sits
there waiting for input. It can do this since the thread library will
run other threads to prevent the program from completely halting.
The downside of threads used to be that most application code, its
libraries and even POSIX calls weren't all thread safe. POSIX added
some new functions (the "*_r" ones) to fix the problems, and many
libraries are thread safe. Still, some aren't and so things like
Tk must be dealt with specially to keep all the Tk calls in a single
thread.
That doesn't prevent you from writing non-thread safe code, or using
libraries (like biopython?) which aren't thread safe. You start having
to worry about how to serialize library calls so that you don't trigger
problems. Hint: use the higher level primitives for threading, like
Queue.
Debugging becomes more complicated because if there are timing problems,
like non-thread safe libraries, you can't always get a good
reproducible. I tend to write my threaded objects with a very state
machine like behaviour so that I can make good guarantees about when
and how it should be used. (This is a good programming style in general.)
Also, Python's core is only thread safe at the coarse grained level.
There is a single, global interpreter lock which prevents two pieces
of Python code from running at the same time. The lock is rescended
every so often to allow multiple threads to work. However, this is not
a problem with you since you aren't interested in threads as a way to
increase compute performance.
It used to be that there were a lot of timing problems because the
thread libraries were buggy, but those
Given all of this, I suggest using threads. It's an easier programming
model (even given the possible non-thread safe parts), works on Unix
and MS Windows, and there are now more people with thread development
experience than select. It looks like Antoine is one to ask :)
Here's a sketch of one way to write your code using threads. It assumes
all GUI events are serialized in one thread, which is the main one.
class BlastWindow:
def __init__(self, gui_change):
self.gui_change = gui_change
self.result = None
def set_results(self, result):
# using the caller's thread, not the GUI thread, so set the
# data but don't do anything using the GUI until called later
self._result = result
self.gui_change.put(self)
def do_change(self):
# the BLAST run is finished, so get the result data and use it to
# update the window
self.result = result
del self._result
# change GUI ...
class BlastTask(threading.Thread):
def __init__(self, blast_params, window):
threading.Thread.__init__(self)
self.window = window
...
def run(self):
# set up the tmpdir and files like .ncbirc, etc.
os.system("cd tmpdir; blast -i ..")
# no error checking for now
self.window.set_results(blast_parse(open("tmpdir/blast.output")))
gui_change = Queue(-1) # used to serialize GUI updates
app = App(gui_change)
window = app.createBlastWindow()
...
blast = BlastTask(blast_params, window)
...
while 1:
change = gui_change.get()
if change == <exit>: # however you define an "exit"
break
change.do_change()
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list