[Bioperl-l] Reliability of get_Seq_by_gi

Thu Apr 1 17:02:36 EST 2004

Hi,

I have been struggling for a long time with a bioperl script that
retrieves a series of sequences in genbank format.  The problem
is that there is the occasional error in downloading a sequence
which results in a fatal error in my script.  I need to be able
to gracefully recover from a downloading error such that my
script continues to run.

Here is my script:

#!/usr/local/bin/perl -w
#
#
use Bio::DB::GenBank;
my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
my $accOrGi = shift or die $usage;
my $outfile = shift or die $usage;
my $wtime = 73;
#
# Prepare input.
#
my @accOrGiList = ();   # empty the list
if (-r $accOrGi) {
    open(IN, $accOrGi);
    while (<IN>) {
       chomp;
       my @junk = split(/\s+/);
       push(@accOrGiList, $junk[0]);
       }
    close(IN);
    }
else {
    push(@accOrGiList, $accOrGi);
    }
#
# Prepare output.
#
my $seq_out;
if ($outfile =~ /\.gb$/) {
    $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
    }
elsif ($outfile =~ /\.fa$/) {
    $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
    }
else {
    die "Unrecognized output file extension. Use either .gb or .fa\n";
    }
#
# Get the entries.
#
my $gb = Bio::DB::GenBank->new();
my $seq;
my $got = 0;
foreach $accOrGi (@accOrGiList) {
    if ($accOrGi =~ /^\d+/) {
       print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
       $seq = $gb->get_Seq_by_gi($accOrGi);
       }
    else {
       print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
       $seq = $gb->get_Seq_by_acc($accOrGi);
       }
    $seq_out->write_seq($seq);
    $got++;
    sleep $wtime unless ($wtime == 0);
    }

Using the above script, at least once a day, I get this error:

Getting gi 45498451 (747/41132)...
Getting gi 45498450 (748/41132)...
Getting gi 45498449 (749/41132)...

------------- EXCEPTION  -------------
MSG: WebDBSeqI Request Error:
500 (Internal Server Error) read timeout
Client-Date: Thu, 01 Apr 2004 13:54:45 GMT

STACK Bio::DB::WebDBSeqI::_stream_request 
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
STACK Bio::DB::WebDBSeqI::get_seq_stream 
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
STACK Bio::DB::WebDBSeqI::get_Stream_by_gi 
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
STACK Bio::DB::WebDBSeqI::get_Seq_by_gi 
/usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50

--------------------------------------

-------------------- WARNING ---------------------
MSG: gi (45498449) does not exist
---------------------------------------------------

------------- EXCEPTION  -------------
MSG: Attempting to write with no seq!
STACK Bio::SeqIO::genbank::write_seq 
/usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77

--------------------------------------

and then the script dies.  I have tried wrapping the "get" part
inside an eval like this:

foreach $accOrGi (@accOrGiList) {
    eval {
       if ($accOrGi =~ /^\d+/) {
          print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
          $seq = $gb->get_Seq_by_gi($accOrGi);
          }
       else {
          print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
          $seq = $gb->get_Seq_by_acc($accOrGi);
          }
       };
    if ($@) {
       print("$accOrGi failed.\n");
       }
    .
    .
    .

This prevents the script from dieing, but it causes several other problems:
    - it spawns a second instance of my script for each error
    - STDOUT stops being echoed to the terminal (I no longer see any printed output)
    - it starts getting duplicate sequences

Is there a reliable way to retrieve sequeces from NCBI?

I'm using bioperl-1.4.
-- 
Wes Barris
E-Mail: Wes.Barris at csiro.au