[Bioperl-l] Remote blast fork errors / Process limit restrictions

Chris Fields cjfields at illinois.edu
Thu Dec 17 20:25:56 UTC 2009


Robert,

I have previously outlined specifically why you are seeing the fork issue, and a possible solution.  IIRC it primarily has to do with you trying to do something more advanced using the (very basic) Bio::Perl procedural interface, something along the lines of pulling a sequence and using RemoteBlast.  Retrieving a sequence from a remote database is a forked process on most OS's (I think Win is the sole exception) and occurs internally in Bio::Perl via Bio::DB::GenBank.  Setting up your own pipeline, using Bio::DB::GenBank (set to use temp files), followed by Bio::Tools::Run::RemoteBlast or Bio::Perl, are options in the meantime.

Trying to catch signals can be notoriously flaky cross-platform and cross perl versions; I recall running into problems with CygWin and OS X.  We can modify Bio::Perl to use a temp file instead, which avoids the whole use of forks altogether, and is probably the best long-term solution.

My last bit: I don't usually say this, primarily b/c it's misconstrued by some, but 'patches are always welcome'.  What doesn't work is just telling us to arbitrarily change code w/o indicating exactly where to do so.  The tone you use, which comes off a tad condescending, can be abrasive and may not garner any response (or at least will get you one you don't expect).  Please keep that in mind.

chris

On Dec 17, 2009, at 1:42 PM, Robert Bradbury wrote:

> Just to close out the issue of bioperl forking (in particular accesses to
> external databases through get_sequence) which involves individual database
> sub-modules and not collecting its children.
> 
> As it turns out the code does do an explicit fork, it looks like so the
> child process can read from the database while the parent process
> manipulates the data as it becomes available.  Now, one could argue that a
> threaded model might be better since now threads are fairly standard OS
> tools in current environments.
> 
> But I couldn't find any functions which actually wait for the forked process
> (presumably because they are created for "future" use).  But nor is there
> any indication in the pages I've found in most of the documentation (which
> is spread across the web) or Wiki that explain that "creating child
> processes" is how these functions work and one *needs* to collect those
> children after each use or else zombie processes will accumulate, which on
> "reasonable" systems with per-user process limits will create problems for
> proper program functioning.  Nor (it would appear) does the parent process
> setup a SIGCHLD "catcher" which could collect the processes once they exit
> (which I expect in the case of "get_sequence" would be after closing of the
> socket which actually fetched the sequence from Genbank.
> 
> It can be resolved easily enough by adding a call after each use of these
> functions:
>   $kid = waitpid(-1, WNOHANG);
> But typically, as a programmer, I should not be responsible for having to
> clean up the leftovers of library calls (unless said cleanup requirements
> are clearly documented).
> 
> 
> But to a "newbie" using the functions, coming from a functional background
> (C), not an OO background (which at least I would tend to view as a wart on
> the otherwise robust Perl language), there are two problems
> 1. The lack of documentation and examples explaining how the functions work
> and how they must be handled at a higher level (by executing explicit wait
> system calls).
> 2. The lack of code in the BioPerl functions to deal with the forked
> processes which they create.  Functional programmers have a perspective --
> if you create it -- you have to clean it up.  It would appear that in the
> transition to OO programming (or perhaps simply for expediency) that detail
> was left out of both (either/and) the documentation and the code.  From this
> standpoint one could view garbage collectors as being fundamentally evil --
> because they gloss over the fact that programmers should know what they are
> doing and when they are doing it.
> 
> So, everywhere in the documentation where there is a get_sequence call (or
> anything which accesses an external database which causes a fork to occur)
> there should be a modification as I have outlined above -- or else the code
> should be corrected so orphaned children are always collected and not
> allowed to accumulate.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list