[Bioperl-l] Reliability of get_Seq_by_gi

Sun Apr 4 18:45:56 EDT 2004

Just a suggestion, but have you tried making sure that $seq exists 
before writing. Something like this should fix the problem:

       }
    if ($seq) {$seq_out->write_seq($seq)}
    else {print STDERR "Couldn't retrieve a sequence for $accOrGi\n"}
    $got++;
    sleep $wtime unless ($wtime == 0);
    }

I don't think $seq exists if the server times out.

You should capture all the failed sequences and then validate them in 
case there is some error with the Acc/GI (like they doesn't exist any 
more).

Rob

On Apr 4, 2004, at 5:22 PM, Wes Barris wrote:

> Heikki Lehvaslaiho wrote:
>> Wes,
>> See:
>> 	http://bioperl.org/pipermail/bioperl-l/2004-March/015423.html
>> Yours,
>> 	-Heikki
>
> Yes, but as I pointed out in my message, there are some problems in
> doing that.
>
>> On Thursday 01 Apr 2004 22:02, Wes Barris wrote:
>>> Hi,
>>>
>>> I have been struggling for a long time with a bioperl script that
>>> retrieves a series of sequences in genbank format.  The problem
>>> is that there is the occasional error in downloading a sequence
>>> which results in a fatal error in my script.  I need to be able
>>> to gracefully recover from a downloading error such that my
>>> script continues to run.
>>>
>>> Here is my script:
>>>
>>> #!/usr/local/bin/perl -w
>>> #
>>> #
>>> use Bio::DB::GenBank;
>>> my $usage = "Usage: $0 <gi|accession|infile> <out.gb|out.fa>\n";
>>> my $accOrGi = shift or die $usage;
>>> my $outfile = shift or die $usage;
>>> my $wtime = 73;
>>> #
>>> # Prepare input.
>>> #
>>> my @accOrGiList = ();   # empty the list
>>> if (-r $accOrGi) {
>>>    open(IN, $accOrGi);
>>>    while (<IN>) {
>>>       chomp;
>>>       my @junk = split(/\s+/);
>>>       push(@accOrGiList, $junk[0]);
>>>       }
>>>    close(IN);
>>>    }
>>> else {
>>>    push(@accOrGiList, $accOrGi);
>>>    }
>>> #
>>> # Prepare output.
>>> #
>>> my $seq_out;
>>> if ($outfile =~ /\.gb$/) {
>>>    $seq_out = Bio::SeqIO->new( -format=>'genbank', 
>>> -file=>">$outfile");
>>>    }
>>> elsif ($outfile =~ /\.fa$/) {
>>>    $seq_out = Bio::SeqIO->new( -format=>'fasta', -file=>">$outfile");
>>>    }
>>> else {
>>>    die "Unrecognized output file extension. Use either .gb or .fa\n";
>>>    }
>>> #
>>> # Get the entries.
>>> #
>>> my $gb = Bio::DB::GenBank->new();
>>> my $seq;
>>> my $got = 0;
>>> foreach $accOrGi (@accOrGiList) {
>>>    if ($accOrGi =~ /^\d+/) {
>>>       print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>       $seq = $gb->get_Seq_by_gi($accOrGi);
>>>       }
>>>    else {
>>>       print("Getting accession $accOrGi ($got/$#accOrGiList)...\n");
>>>       $seq = $gb->get_Seq_by_acc($accOrGi);
>>>       }
>>>    $seq_out->write_seq($seq);
>>>    $got++;
>>>    sleep $wtime unless ($wtime == 0);
>>>    }
>>>
>>> Using the above script, at least once a day, I get this error:
>>>
>>> Getting gi 45498451 (747/41132)...
>>> Getting gi 45498450 (748/41132)...
>>> Getting gi 45498449 (749/41132)...
>>>
>>> ------------- EXCEPTION  -------------
>>> MSG: WebDBSeqI Request Error:
>>> 500 (Internal Server Error) read timeout
>>> Client-Date: Thu, 01 Apr 2004 13:54:45 GMT
>>>
>>>
>>>
>>> STACK Bio::DB::WebDBSeqI::_stream_request
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:728
>>> STACK Bio::DB::WebDBSeqI::get_seq_stream
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:460
>>> STACK Bio::DB::WebDBSeqI::get_Stream_by_gi
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:330
>>> STACK Bio::DB::WebDBSeqI::get_Seq_by_gi
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:210
>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:50
>>>
>>> --------------------------------------
>>>
>>> -------------------- WARNING ---------------------
>>> MSG: gi (45498449) does not exist
>>> ---------------------------------------------------
>>>
>>> ------------- EXCEPTION  -------------
>>> MSG: Attempting to write with no seq!
>>> STACK Bio::SeqIO::genbank::write_seq
>>> /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/genbank.pm:591
>>> STACK toplevel /home/wes/proj/genbank/getGenbank.pl:77
>>>
>>> --------------------------------------
>>>
>>> and then the script dies.  I have tried wrapping the "get" part
>>> inside an eval like this:
>>>
>>> foreach $accOrGi (@accOrGiList) {
>>>    eval {
>>>       if ($accOrGi =~ /^\d+/) {
>>>          print("Getting gi $accOrGi ($got/$#accOrGiList)...\n");
>>>          $seq = $gb->get_Seq_by_gi($accOrGi);
>>>          }
>>>       else {
>>>          print("Getting accession $accOrGi 
>>> ($got/$#accOrGiList)...\n");
>>>          $seq = $gb->get_Seq_by_acc($accOrGi);
>>>          }
>>>       };
>>>    if ($@) {
>>>       print("$accOrGi failed.\n");
>>>       }
>>>    .
>>>    .
>>>    .
>>>
>>> This prevents the script from dieing, but it causes several other 
>>> problems:
>>>    - it spawns a second instance of my script for each error
>>>    - STDOUT stops being echoed to the terminal (I no longer see any
>>> printed output) - it starts getting duplicate sequences
>>>
>>> Is there a reliable way to retrieve sequeces from NCBI?
>>>
>>> I'm using bioperl-1.4.
>
>
> -- 
> Wes Barris
> E-Mail: Wes.Barris at csiro.au
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l