[Bioperl-l] Remote blast fork errors / Process limit restrictions

Robert Bradbury robert.bradbury at gmail.com
Mon Dec 7 20:41:54 UTC 2009


This comment could also have a subject line: "Why does Bioperl/get_sequence>
fork at all!  Why are not all operations sequential?  And if this is a
"default" mode that I'm unaware of -- How to I ever write a reliable BioPerl
script if I have little or no capability of what the program uses when it
runs?  I may have days so I can bear the burden of relatively slow results
(and so can use sequential processing rather than parallel).

I've got a perl script that uses remote blast to blast a sequence against a
subset of the NCBI sequences.  It "mostly" works, in that it returns a
seemingly complete .bls result file but when attempting to look at the
sequences (so it can more accurately summarize the information from the
results than a standard blast report allows) it terminates prematurely with
errors.

The error is:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Couldn't fork: Resource temporarily unavailable
STACK: Error::throw
STACK: Bio::Root::Root::throw
/usr/lib/perl5/vendor_perl/5.8.8/Bio/Root/Root.pm:368
STACK: Bio::DB::WebDBSeqI::_open_pipe
/usr/lib/perl5/vendor_perl/5.8.8/Bio/DB/WebDBSeqI.pm:722
STACK: Bio::DB::WebDBSeqI::get_seq_stream
/usr/lib/perl5/vendor_perl/5.8.8/Bio/DB/WebDBSeqI.pm:463
STACK: Bio::DB::NCBIHelper::get_Stream_by_acc
/usr/lib/perl5/vendor_perl/5.8.8/Bio/DB/NCBIHelper.pm:479
STACK: Bio::DB::WebDBSeqI::get_Seq_by_acc
/usr/lib/perl5/vendor_perl/5.8.8/Bio/DB/WebDBSeqI.pm:186
STACK: Bio::Perl::get_sequence
/usr/lib/perl5/vendor_perl/5.8.8/Bio/Perl.pm:520
STACK: main::acc_2_desc /home/bradbury/Genomes/bin/RB.pl:182
STACK: /home/bradbury/Genomes/bin/RB.pl:155
-----------------------------------------------------------

The precise line (in my code) whcih appears to be generating the error is:
    $seq = get_sequence('GenBank', $accsn);

Now this can be a problem if NCBI/Genbank fails due to load conditions --
but this specific failure (which is repeatable is due to most likely hitting
the user process limit restrictions) -- but the small blast results work
fine -- its only if the Blast has returned several hundred hits that it runs
into this problem.

Now what it sounds like to me is an attempt to do multiple asynchronous NCBI
queries (to get a sequence) with complete disregard of the environment
(process limits, NCBI limits, etc.).  But I do not know enough about how
this works to point a finger at some specific function.  As a result
get_sequence process results are accumulated, summarized, etc. without ever
having issued to respect "wait-variant()) calls to collect former children
[This IMO would clearly be a bug.]

It could be adjusted to by allowing the BioPerl library to run in 3 modes.
 (1) completely synchronous -- if you fork you wait until its done -- and
you collect "it" and any fork fails then one either collects the process or
switches to the non-conservative mode.

Robert



More information about the Bioperl-l mailing list