[Bioperl-l] Remote BLAST returning lots of 500 time out errors

Barry Moore barry.moore at genetics.utah.edu
Sat Feb 7 23:24:48 EST 2004


I have a script that uses Bio::Tools::Run::RemoteBlast to BLAST the 
translations of all ORFs from a mRNA transcript against the database. It 
works fine, except that if I run several sequences at once, after about 
50 ORFs worth of BLASTing, NCBI starts to return errors (500 read 
time-out) for every job submitted. I can't figure out what's going on 
here. Any ideas?

The script is kind of long and it take several minutes to get to the 
errors, but if anyone wants to try to recreate the error I've attached 
the code below. Some of you will probably recognize bits of your code 
that I've pilfered from various Bioperl docs. I'm running Bioperl 1.4, 
ActiveState perl 5.8.0.805 on Windows XP. I get the error by running:

perl ORF_BLAST1.pl –min_length 150 NM_001112 NM_007327 NM_015833 NM_021569

Barry Moore
Dept. Human Genetics
University of Utah

------------------------------------------------------------------------------------------------------

#!/usr/bin/perl

#ORF_BLAST1.pl
#See end of file for POD documentation

use strict;
use warnings;
use GD;
use Getopt::Long;
use Bio::SeqIO;
use Bio::PrimarySeq;
use Bio::DB::GenBank;
use Bio::Tools::Run::RemoteBlast;

#Give documentation when requested, or when missing command line arguments.
if ( ! $ARGV[0] || $ARGV[0] =~ /^-{1,2}(h|help|\?)$/i ) {
system ( "perldoc", $0 ) and die "For usage, use perldoc $0\n";
exit( 0 );
}

#Declare module level variables.
my $in_filename; #This variable holds the filename of the input file.
my $out_filename; #Suprisingly this variable holds the filname of the 
output file.
my $format; #This variable defines the output format (png or jpg).
my $min_length; #This variable defines the minimum ORF length to plot.
my $require_start; #This boolean varible identifies of an ORF must begin 
with a start.
my $seqio; #This variable holds the SeqIO object.

#Handle command line options.
GetOptions (
'in_file:s' => \$in_filename,
'out_file:s' => \$out_filename,
'format:s' => \$format,
'min_length:i' => \$min_length,
'start!' => \$require_start
);
my @accession = @ARGV;

#Set defaults.
$format ||= 'jpg';
$min_length ||= 150;
$require_start ||= 0;

#Define new SeqIO object. Take input from a file first if a
#filename has been specified. Otherwise take input from accession 
numbers off
#the command line (but don't try to do both)
if ($in_filename) {
$seqio = Bio::SeqIO->new(-format=>'fasta', -file=>$in_filename) or die 
"could not create Bio::SeqIO";
}
elsif (@accession) {
my $gb = new Bio::DB::GenBank();
$seqio = $gb->get_Stream_by_id(\@accession);
}

#Remote-BLAST factory object creation and blast-parameter initialization
my $BLAST_factory = Bio::Tools::Run::RemoteBlast->new('-prog' => 'blastp',
'-data' => 'nr',
'-expect' => '10',
'-readmethod' => 'SearchIO' );

#Main program loop to loop over all sequences input.
while (my $seq_obj = $seqio->next_seq) {
#Assign sequence specfic variables
my @starts = ();
my @stops = ();
my @orfs = ();
my $out_filename;
my $sequence = $seq_obj->seq;
my $sequence_length = length($sequence);
my $header = $seq_obj->display_name."|".$seq_obj->desc;

#Internal loop to find starts, stops, and ORFs
for my $count1 (0 .. $sequence_length) {
my $open = 0;
if ($count1 < 4) {$open = 1}
my $count2 = $count1;
my $codon = substr($sequence, $count1, 3);
my $frame = $count1 % 3; #Get the modulus of $count1/3 for ascertaining 
the frame
#Convert the modulus above into frame 1, 2, or 3.
if ($frame == 1) {$frame = 2}
elsif ($frame == 2) {$frame = 3}
elsif ($frame == 0) {$frame = 1}
#Add starts to stack.
if ($codon =~ /ATG/i) {
push @starts, {start => $count1,
frame => $frame};
if ($require_start == 1) {$open = 1} #Open the ORF flag.
}
#Add stops to stack.
if ($codon =~ /TGA|TAG|TAA/i) {
push @stops, {stop => $count1,
frame => $frame};
if ($require_start == 0) {$open = 1} #Open the ORF flag.
}
#Find extend of ORF if one has been opened by either of the above
#conditionals.
if ($open == 1) {
$codon = "";
my $count2 = $count1;
#Loop to step forward through ORF looking for next in frame stop.
while (($codon !~ /TGA|TAG|TAA/i) and ($count2 < $sequence_length - 4)) {
$count2 = $count2 + 3; #Keep it in frame.
$codon = substr($sequence, $count2, 3);
}
#Make sure the ORF is long enough to count...
if ($count2 - $count1 >= $min_length) {
push @orfs, {begin => $count1,
end => $count2,
frame => $frame
}; #...then push it onto the ORF stack.
}
}
}

@orfs || die "No ORFs of $min_length nucleotide in length found";

#Loop to BLAST each ORF against the database, and check for a hit.
my $BLAST_count;
for my $orf (@orfs) {
#Assign ORF specific variables.
my $begin = $$orf{begin};
my $end = $$orf{end};
my $frame = $$orf{frame};
$BLAST_count++;

#Initialize subsequence as new sequence
my $seq = new Bio::PrimarySeq
(-seq => $seq_obj->subseq($begin + 1, $end),
-display_id => "${frame}_${begin}_${end}");
#Translate sequence
my $trans = $seq->translate();

#Blast the sequence against a database:
my $job = $BLAST_factory->submit_blast($trans);
print STDERR "Blasting ORF ",$BLAST_count," of ", scalar @orfs, "...";
#Loop to load the RIDs returned for the BLAST job submitted (this probably
#doesn't need to be a loop here but I won't take it out yet)
while ( my @rids = $BLAST_factory->each_rid ) {
#Loop iterate over RIDs, and hit NCBI's BLAST server for a result
foreach my $rid ( @rids ) {
#Hit the server for a result on RID.
my $blast_results = $BLAST_factory->retrieve_blast($rid);
#Was a result returned?
if( !ref($blast_results) ) {
#If so and it returned an error remove that RID from the stack
if ($blast_results < 0) {
$BLAST_factory->remove_rid($rid);
}
print STDERR "."; #Keep the user staring at the dots.
sleep 5; #Plays nice with the servers.
}
#If a result was returned and it isn't an error, then pass it to a
#variable...
else {
my $result = $blast_results->next_result();
$BLAST_factory->remove_rid($rid); #...and remove it's RID from the stack.
#Check the result for a hit...
my $hit = $result->next_hit;
if (ref($hit)) {
my $hsp = $hit->next_hsp;
#...collect it's evalue from the hsp object, and add to the ORFs hash
$$orf{evalue} = $hsp->evalue();
}
#If no evalue found, default to 100 to keep undef from looking like a
#significant e-value.
else {$$orf{evalue} = 100}
print "\n";
}
}
}
}

#Main block to draw image.
my $image = new GD::Image(900, 150); #Create a new image.

#Allocate some colors.
my %color = (
white => $image->colorAllocate(255,255,255),
aqua => $image->colorAllocate(0,255,255),
black => $image->colorAllocate(0,0,0),
blue => $image->colorAllocate(0,0,255),
gray => $image->colorAllocate(128,128,128),
fuchsia => $image->colorAllocate(255,0,255),
green => $image->colorAllocate(0,255,0),
lime => $image->colorAllocate(0,255,255),
maroon => $image->colorAllocate(128,0,0),
navy => $image->colorAllocate(0,0,128),
olive => $image->colorAllocate(128,128,0),
purple => $image->colorAllocate(128,0,128),
red => $image->colorAllocate(255,0,0),
silver => $image->colorAllocate(192,192,192),
teal => $image->colorAllocate(0,128,128),
yellow1 => $image->colorAllocate(255,255,0),
yellow2 => $image->colorAllocate(200,200,0),
yellow3 => $image->colorAllocate(150,150,0)
);

#Make the background transparent and interlaced.
$image->transparent($color{white});
$image->interlaced('true');
#Put a black frame around the picture.
$image->rectangle(0,0,899,149,$color{black});
#Add the title line.
$image->string(gdGiantFont,10,10,$header,$color{black});
#Draw the lines for each frame.
$image->line(10,50,890,50,$color{black});
$image->line(10,75,890,75,$color{black});
$image->line(10,100,890,100,$color{black});
#Draw a line for the ruler.
$image->line(10,125,890,125,$color{black});

#Loop to add ruler ticks and numbers to image.
for my $tick (0 .. 10) {
#Convert sequence coordniates to image X-asix values.
$tick = $sequence_length/10*$tick;
$tick = convert($tick, $sequence_length);
#Add ruler ticks.
$image->line($tick,125,$tick,130,$color{black});
#Add nubmers to ruler.
$image->string(gdSmallFont,$tick-(2*length($tick-15)),130,$tick-15,$color{black});
}

#Loop to add ORFs to image.
for my $orf (@orfs) {
my $top; #The variable sets the top of ORF rectangle.
my $bottom; #This varibale sets the bottom of ORF rectangle.
#Asign the Y coordinates for the ORF to place them in the correct frame.
if ($$orf{frame} == 1) {$top = 40; $bottom = 60}
elsif ($$orf{frame} == 2) {$top = 65; $bottom = 85}
elsif ($$orf{frame} == 3) {$top = 90; $bottom = 110}
#Convert sequence coordniates to image X-axis values.
my $begin = convert($$orf{begin}, $sequence_length);
my $end = convert($$orf{end}, $sequence_length);
#Asign a shade of yellow to the ORF if the BLAST on that ORF returned an
#evaule.
my $orf_color = $color{black}; #Default ORF color to black.
if (defined $$orf{evalue}) {
if ($$orf{evalue} <= 10) {$orf_color = $color{yellow3}} #Dark yellow
if ($$orf{evalue} <= 1.0e-3) {$orf_color = $color{yellow2}} #Meduim yellow
if ($$orf{evalue} < 1.0e-25) {$orf_color = $color{yellow1}} #Bright yellow
}
#Draw rectangles for the ORFs.
$image->filledRectangle($begin,$top,$end,$bottom,$orf_color);
#Print the e-value on the ORF if it is below 10.
if ($$orf{evalue} < 10) {
$image->string(gdSmallFont,$begin + 3,$top + 2,$$orf{evalue},$color{black});
}
}

#Add green ticks to the image for each start.
for my $start (@starts) {
my $top; #This variable sets the top of the start line.
my $bottom; #This varibale sets the bottom of the start line.
#Assign the Y coordinates for the start line to put it in the correct frame.
if ($$start{frame} == 1) {$top = 50; $bottom = 60}
elsif ($$start{frame} == 2) {$top = 75; $bottom = 85}
elsif ($$start{frame} == 3) {$top = 100; $bottom = 110}
#Convert sequence coordniates to image X-axis values.
my $location = convert($$start{start}, $sequence_length);
#Draw the start ticks.
$image->line($location,$top,$location,$bottom,$color{green});
}

#Add red ticks to the image for each stop.
for my $stop (@stops) {
my $top; #This variable sets the top of the stop line.
my $bottom; #This varibale sets the bottom of the stop line.
#Assign the Y coordinates for the stop line to put it in the correct frame.
if ($$stop{frame} == 1) {$top = 40; $bottom = 60}
elsif ($$stop{frame} == 2) {$top = 65; $bottom = 85}
elsif ($$stop{frame} == 3) {$top = 90; $bottom = 110}
#Convert sequence coordniates to image X-asix values.
my $location = convert($$stop{stop}, $sequence_length);
#Draw the stop ticks.
$image->line($location,$top,$location,$bottom,$color{red});
}

#Set a default output filename if none was set on the command line.
if (! $out_filename) {
if ( $in_filename &&=~ /(.*?)\..*/) {
$out_filename = $1.".".$format;
}
elsif ( $seq_obj->primary_id !~ /unknown/) {
$out_filename = $seq_obj->display_name().".".$format;
}
else {
$out_filename = $seq_obj->primary_id().".".$format;
}
}
#Open a filehandle for output and make sure we are writing a binary stream.
open (OUT, ">$out_filename");
binmode OUT;

#Write the image to a file in specified format.
if ($format =~ /jpg|jpeg/) {
print OUT $image->jpeg;
}
if ($format =~ /png/) {
print OUT $image->png;
}
close OUT;
}

#A subroutine to convert sequence coordinates to x-axis values on the image.
sub convert {
my ($value, $length) = @_;
$value = (($value/$length)*870)+15; #Convert a sequence length value to 
an X-axis value.
$value = sprintf("%.0f", $value); #Round $value to nearest integer.
return $value;
}

=head1 NAME

ORF_BLAST1.pl

=head1 SYNOPSIS

perl ORF_Plot.pl [--options] NM_007327

=head1 DESCRIPTION

This program will take a sequence file as input, and generate a 
graphical output
of it's ORF architecture in 3 frames plotting ORFs, start codons (ATG) 
and stop
codons. It will BLAST the translation of each ORF against NCBI, and 
color the
shades of yellow to black, depending on the e-value returned for that ORF.

INPUT:

Input can be a list of space seperated accession numbers on the command 
line,
or a fasta file.

OUTPUT:

Output is a figure saved as either a png or jpg file to the current 
directory.

OPTIONS:

Several options can be specified, but all are optional.

--in_file filename
Use to set the input file name. The file that contains the input
sequences in fasta format.
--out_file filename
Use to set the output file name. Defaults to input file name, then
Bioperl's display name (usually the accession number), then Bioperl's
accession number (usually the gi number).
--min_size integer
Use to set the minimum ORF size that will be plotted in the figure.
--start
Use to require plotted ORFs to begin with a start
--format
Use to set the output format. Valid values are png or jpg (or jpeg).
Defaults to jpg.

=head1 USING

perl ORF_BLAST1.pl --min_size 300 --start --format png NM_001112 
NM_007327 NM_015833

or

perl ORF_BLAST1.pl --in_file sequence.fasta --out_file image_file 
--min_size 300 --start

=head1 REQUIRES

GD
Getopt::Long
Bio::SeqIO
Bio::PrimarySeq
Bio::DB::GenBank
Bio::Tools::Run::RemoteBlast

=head1 AUTHOR

Barry Moore
Department of Human Genetics
University of Utah
Salt Lake City, UT 84112
USA

Address bug reports and comments to: barry.moore at genetics.utah.edu

=head1 BUGS

Currently after about 50 ORFs BLASTed, NCBI starts to return time-out 
errors.

=head1 FUTURE DIRECTIONS

Add command line options for the BLAST parameters.

=head1 COPYRIGHT

Copyright 2003, Barry Moore. All rights reserved.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=head1 SEE ALSO



=cut





More information about the Bioperl-l mailing list