[Bioperl-l] Remote blast fork errors / Process limit restrictions

Robert Bradbury robert.bradbury at gmail.com
Thu Dec 17 19:42:54 UTC 2009


Just to close out the issue of bioperl forking (in particular accesses to
external databases through get_sequence) which involves individual database
sub-modules and not collecting its children.

As it turns out the code does do an explicit fork, it looks like so the
child process can read from the database while the parent process
manipulates the data as it becomes available.  Now, one could argue that a
threaded model might be better since now threads are fairly standard OS
tools in current environments.

But I couldn't find any functions which actually wait for the forked process
(presumably because they are created for "future" use).  But nor is there
any indication in the pages I've found in most of the documentation (which
is spread across the web) or Wiki that explain that "creating child
processes" is how these functions work and one *needs* to collect those
children after each use or else zombie processes will accumulate, which on
"reasonable" systems with per-user process limits will create problems for
proper program functioning.  Nor (it would appear) does the parent process
setup a SIGCHLD "catcher" which could collect the processes once they exit
(which I expect in the case of "get_sequence" would be after closing of the
socket which actually fetched the sequence from Genbank.

It can be resolved easily enough by adding a call after each use of these
functions:
   $kid = waitpid(-1, WNOHANG);
But typically, as a programmer, I should not be responsible for having to
clean up the leftovers of library calls (unless said cleanup requirements
are clearly documented).


But to a "newbie" using the functions, coming from a functional background
(C), not an OO background (which at least I would tend to view as a wart on
the otherwise robust Perl language), there are two problems
1. The lack of documentation and examples explaining how the functions work
and how they must be handled at a higher level (by executing explicit wait
system calls).
2. The lack of code in the BioPerl functions to deal with the forked
processes which they create.  Functional programmers have a perspective --
if you create it -- you have to clean it up.  It would appear that in the
transition to OO programming (or perhaps simply for expediency) that detail
was left out of both (either/and) the documentation and the code.  From this
standpoint one could view garbage collectors as being fundamentally evil --
because they gloss over the fact that programmers should know what they are
doing and when they are doing it.

So, everywhere in the documentation where there is a get_sequence call (or
anything which accesses an external database which causes a fork to occur)
there should be a modification as I have outlined above -- or else the code
should be corrected so orphaned children are always collected and not
allowed to accumulate.



More information about the Bioperl-l mailing list