[Bioperl-l] Reliability of get_Seq_by_gi

Wes Barris wes.barris at csiro.au
Mon Apr 5 18:44:12 EDT 2004


Rob Edwards wrote:

> Just a suggestion, but have you tried making sure that $seq exists 
> before writing. Something like this should fix the problem:
> 
>       }
>    if ($seq) {$seq_out->write_seq($seq)}
>    else {print STDERR "Couldn't retrieve a sequence for $accOrGi\n"}
>    $got++;
>    sleep $wtime unless ($wtime == 0);
>    }
> 
> I don't think $seq exists if the server times out.
> 
> You should capture all the failed sequences and then validate them in 
> case there is some error with the Acc/GI (like they doesn't exist any 
> more).
> 
> Rob

Hi Rob,

Your solution appears to have worked.  I was able to download sequences
over night and my script was still running this morning -- the first time
it has ever done that.  Thanks.

> 
> 
> 
> On Apr 4, 2004, at 5:22 PM, Wes Barris wrote:
> 
>> Heikki Lehvaslaiho wrote:
>>
>>> Wes,
>>> See:
>>>     http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
>>> Yours,
>>>     -Heikki
>>
>>
>> Yes, but as I pointed out in my message, there are some problems in
>> doing that.
>>
>>> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been struggling for a long time with a bioperl script that
>>>> retrieves a series of sequences in genbank format.  The problem
>>>> is that there is the occasional error in downloading a sequence
>>>> which results in a fatal error in my script.  I need to be able
>>>> to gracefully recover from a downloading error such that my
>>>> script continues to run.
>>>>
>>>> Here is my script:
>>>>
>>>> #!/usr/local/bin/perl -w
>>>> #
>>>> #
>>>> use Bio::DB::GenBank;
>>>> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>>> my $accOrGi = shift or die $usage;
>>>> my $outfile = shift or die $usage;
>>>> my $wtime = 73;
>>>> #
>>>> # Prepare input.
>>>> #
>>>> my @accOrGiList = ();   # empty the list
>>>> if (-r $accOrGi) {
>>>>    open(IN, $accOrGi);
>>>>    while (<IN>) {
>>>>       chomp;
>>>>       my @junk = split(/\s+/);
>>>>       push(@accOrGiList, $junk[0]);
>>>>       }
>>>>    close(IN);
>>>>    }
>>>> else {
>>>>    push(@accOrGiList, $accOrGi);
>>>>    }
>>>> #
>>>> # Prepare output.
>>>> #
>>>> my $seq_out;
>>>> if ($outfile =~ /\.gb$/) {
>>>>    $seq_out = Bio::SeqIO->new( -format=>'genbank', -file=>">$outfile");
>>>>    }
>>>> elsif ($outfile =~ /\.fa$/) {
>>>>    $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>>>>    }
>>>> else {
>>>>    die "Unrecognized output file extension. Use either .gb or .fa\n";
>>>>    }
>>>> #
>>>> # Get the entries.
>>>> #
>>>> my $gb = Bio::DB::GenBank->new();
>>>> my $seq;
>>>> my $got = 0;
>>>> foreach $accOrGi (@accOrGiList) {
>>>>    if ($accOrGi =~ /^\d+/) {
>>>>       print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>>       $seq = $gb->get_Seq_by_gi($accOrGi);
>>>>       }
>>>>    else {
>>>>       print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>>>       $seq = $gb->get_Seq_by_acc($accOrGi);
>>>>       }
>>>>    $seq_out->write_seq($seq);
>>>>    $got++;
>>>>    sleep $wtime unless ($wtime == 0);
>>>>    }
>>>>
>>>> Using the above script, at least once a day, I get this error:
>>>>
>>>> Getting gi 45498451 (747/41132)...
>>>> Getting gi 45498450 (748/41132)...
>>>> Getting gi 45498449 (749/41132)...
>>>>
>>>> ------------- EXCEPTION  -------------
>>>> MSG: WebDBSeqI Request Error:
>>>> 500 (Internal Server Error) read timeout
>>>> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>>>
>>>>
>>>>
>>>> STACK Bio::DB::WebDBSeqI::_stream_request
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>>> STACK Bio::DB::WebDBSeqI::get_seq_stream
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>>> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>>> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>>>
>>>> --------------------------------------
>>>>
>>>> -------------------- WARNING ---------------------
>>>> MSG: gi (45498449) does not exist
>>>> ---------------------------------------------------
>>>>
>>>> ------------- EXCEPTION  -------------
>>>> MSG: Attempting to write with no seq!
>>>> STACK Bio::SeqIO::genbank::write_seq
>>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>>>
>>>> --------------------------------------
>>>>
>>>> and then the script dies.  I have tried wrapping the "get" part
>>>> inside an eval like this:
>>>>
>>>> foreach $accOrGi (@accOrGiList) {
>>>>    eval {
>>>>       if ($accOrGi =~ /^\d+/) {
>>>>          print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>>          $seq = $gb->get_Seq_by_gi($accOrGi);
>>>>          }
>>>>       else {
>>>>          print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>>>          $seq = $gb->get_Seq_by_acc($accOrGi);
>>>>          }
>>>>       };
>>>>    if ($@) {
>>>>       print("$accOrGi failed.\n");
>>>>       }
>>>>    .
>>>>    .
>>>>    .
>>>>
>>>> This prevents the script from dieing, but it causes several other 
>>>> problems:
>>>>    - it spawns a second instance of my script for each error
>>>>    - STDOUT stops being echoed to the terminal (I no longer see any
>>>> printed output) - it starts getting duplicate sequences
>>>>
>>>> Is there a reliable way to retrieve sequeces from NCBI?
>>>>
>>>> I'm using bioperl-1.4.
>>
>>
>>
>> -- 
>> Wes Barris
>> E-Mail: Wes.Barris at csiro.au
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> 


-- 
Wes Barris
E-Mail: Wes.Barris at csiro.au


More information about the Bioperl-l mailing list