[Bioperl-l] Bio::DB::GenBank and large number of requests

Wed Jan 30 15:10:14 UTC 2008

You can use an eval {} block to catch the error, then redo the loop  
(so you don't iterate to the next block) or use next and skip the  
current block if an error occurs.  If you use redo then you should use  
a counter to exit the loop after several tries.

chris

On Jan 30, 2008, at 8:56 AM, Tristan Lefebure wrote:

> Thank you both!
>
> Just in case it might be usefull for someone else, here are my  
> ramblings:
>
> 1. I first tried to adapt my script and fetch 500 sequences at a  
> time. It works, except that ~40% of the time NCBI gives the  
> following error and my script crashed:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: WebDBSeqI Request Error:
> [...]
>    The proxy server received an invalid
>    response from an upstream server.
> [...]
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/ 
> Root.pm:359
> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/Bio/ 
> DB/WebDBSeqI.pm:685
> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/ 
> 5.8.8/Bio/DB/WebDBSeqI.pm:472
> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/perl/ 
> 5.8.8/Bio/DB/NCBIHelper.pm:361
> STACK: ./fetch_from_genbank.pl:68
> -----------------------------------------------------------
>
> I tried to modify the script so that when the retrieval of a 500  
> sequence block crashes, it continues with the other blocks, but I  
> was unsuccessfull. It probably needs some better understanding of  
> BioPerl errors...
> Here is the section of the script that was modified:
> #########
> my $n_seq = scalar @list;
> my @aborted;
>
> for (my $i=1; $i<=$n_seq; $i += 500) {
> 	print "Fetching sequences $i to ", $i+499, ": ";
> 	my $start = $i -1;
> 	my $end = $i + 500 -1;
> 	my @red_list = @list[$start .. $end];
> 	my $gb = new Bio::DB::GenBank(	-retrievaltype => 'tempfile',
> 					-format => $dformat,
> 					-db => $db,
> 				);
>
> 	my $seqio;
> 	unless(	$seqio = $gb->get_Stream_by_acc(\@red_list)) {
> 		print "Aborted, resubmit latter\n";
> 		push @aborted, @red_list;
> 		next;
> 	}
> 	
> 	my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1].$i",
> 					-format => $format,
> 				);
> 	while (my $seqo = $seqio->next_seq ) {
> # 		print $seqo->id, "\n";
> 		$seqout->write_seq($seqo);
> 	}
> 	print "Done\n";
> }
>
> if (@aborted) {
> 	open OUT, ">aborted_fetching.AN";
> 	foreach (@aborted) { print OUT $_ };
> }
> ##########
>
>
> 2. So I moved to the second solution and tried batchentrez. I cut my  
> 120,000 long AN list into 10,000 long pieces using split:
> split -l 10000 full_list.AN splitted_list_
>
> and then submitted the 13 lists one by one. I must say that I don't  
> really like using a web-interface to fetch data, and here the most  
> ennoying part is that you end up with a regular Entrez/GenBank  
> webpage: select your format, export to file, chosse file name... and  
> have to do it many times.
> It is too much prone to human and web-browser errors for my taste,  
> but it worked.
> Nevertheless there is some caveats:
> - some downloaded files were incomplete (~10%) and you have to  
> restart it
> - you can't submit several lists in the same time (otherwise the  
> same cookie will be used and you'll end up with several identical  
> files)
>
> -Tristan
>
> On Tuesday 29 January 2008 13:44:16 you wrote:
>> Forgot about that one; it's definitely a better way to do it if you
>> have the GI/accessions.
>>
>> chris
>>
>> On Jan 29, 2008, at 12:32 PM, Alexander Kozik wrote:
>>> you don't need to use bioperl to accomplish this task, to download
>>> several thousand sequences based on accession ID list.
>>>
>>> NCBI batch Entrez can do that:
>>> http://www.ncbi.nlm.nih.gov/sites/batchentrez
>>>
>>> just submit a large list of IDs, select database, and download.
>>>
>>> you can submit ~50,000 IDs in one file usually without problems.
>>> it may not return results if a list is larger than ~100,000 IDs
>>>
>>> --
>>> Alexander Kozik
>>> Bioinformatics Specialist
>>> Genome and Biomedical Sciences Facility
>>> 451 Health Sciences Drive
>>> Genome Center, 4-th floor, room 4302
>>> University of California
>>> Davis, CA 95616-8816
>>> Phone: (530) 754-9127
>>> email#1: akozik at atgc.org
>>> email#2: akozik at gmail.com
>>> web: http://www.atgc.org/
>>>
>>> Chris Fields wrote:
>>>> Yes, you can only retrieve ~500 sequences at a time using either
>>>> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities
>>>> interact with NCBI's EUtilities (the former module returns raw data
>>>> from the URL to be processed later, the latter module returns
>>>> Bio::Seq/Bio::SeqIO objects).
>>>> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-d
>>>> atasets You can usually post more IDs using epost and fetch  
>>>> sequence
>>>> referring to the WebEnv/key combo (batch posting).  I try to make
>>>> this a bit easier with EUtilities but it is woefully lacking in
>>>> documentation (my fault), but there is some code up on the wiki
>>>> which should work.
>>>> chris
>>>>
>>>> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
>>>>> Hello,
>>>>>
>>>>> I would like to download a large number of sequences from GenBank
>>>>> (122,146 to be exact) following a list of accession numbers.
>>>>> I first investigated around Bio::DB::EUtilities, but got lost and
>>>>> finally used Bio::DB::GenBank.
>>>>> My script works well for short request, but it gives the following
>>>>> error with the long request:
>>>>>
>>>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>>>> MSG: WebDBSeqI Request Error:
>>>>> 500 short write
>>>>> Content-Type: text/plain
>>>>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>>>>> Client-Warning: Internal response
>>>>>
>>>>> 500 short write
>>>>>
>>>>> STACK: Error::throw
>>>>> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/ 
>>>>> Root/
>>>>> Root.pm:359
>>>>> STACK: Bio::DB::WebDBSeqI::_request /usr/local/share/perl/5.8.8/
>>>>> Bio/DB/WebDBSeqI.pm:685
>>>>> STACK: Bio::DB::WebDBSeqI::get_seq_stream /usr/local/share/perl/
>>>>> 5.8.8/Bio/DB/WebDBSeqI.pm:472
>>>>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc /usr/local/share/
>>>>> perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>>>>> STACK: ./fetch_from_genbank.pl:58
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Does that mean that we can only fetch 500 sequences at a time?
>>>>> Should I split my list in 500 ids framents and submit them one
>>>>> after the other?
>>>>>
>>>>> Any suggestions very welcomed...
>>>>> Thanks,
>>>>> -Tristan
>>>>>
>>>>>
>>>>> Here is the script:
>>>>>
>>>>> ##################################
>>>>> use strict;
>>>>> use warnings;
>>>>> use Bio::DB::GenBank;
>>>>> # use Bio::DB::EUtilities;
>>>>> use Bio::SeqIO;
>>>>> use Getopt::Long;
>>>>>
>>>>> # 2008-01-22 T Lefebure
>>>>> # I tried to use Bio::DB::EUtilities without much succes and get
>>>>> back to Bio::DB::GenBank.
>>>>> # The following procedure is not really good as the stream is
>>>>> first copied to a temporary file,
>>>>> # and than re-used by BioPerl to generate the final file.
>>>>>
>>>>> my $db = 'nucleotide';
>>>>> my $format = 'genbank';
>>>>> my $help= '';
>>>>> my $dformat = 'gb';
>>>>>
>>>>> GetOptions(
>>>>>   'help|?' => \$help,
>>>>>   'format=s'  => \$format,
>>>>>   'database=s'    => \$db,
>>>>> );
>>>>>
>>>>>
>>>>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc>  
>>>>> <output>
>>>>>
>>>>> Will download the corresponding data from GenBank. BioPerl is
>>>>> required.
>>>>>
>>>>> Options:
>>>>>   -h
>>>>>       print this help
>>>>>   -format: genbank|fasta|...
>>>>>       give output format (default=genbank)
>>>>>   -database: nucleotide|genome|protein|...
>>>>>       define the database to search in (default=nucleotide)
>>>>>
>>>>> The full description of the options can be find at
>>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/ 
>>>>> efetchseq_help.html
>>>>> \n";
>>>>>
>>>>> if ($#ARGV<1) {
>>>>>   print $printhelp;
>>>>>   exit;
>>>>> }
>>>>>
>>>>> open LIST, $ARGV[0];
>>>>> my @list = <LIST>;
>>>>>
>>>>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>>>>
>>>>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
>>>>>               -format => $dformat,
>>>>>               -db => $db,
>>>>>           );
>>>>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>>>>
>>>>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>>>>>               -format => $format,
>>>>>           );
>>>>> while (my $seqo = $seqio->next_seq ) {
>>>>>   print $seqo->id, "\n";
>>>>>   $seqout->write_seq($seqo);
>>>>> }
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Christopher Fields
>>>> Postdoctoral Researcher
>>>> Lab of Dr. Robert Switzer
>>>> Dept of Biochemistry
>>>> University of Illinois Urbana-Champaign
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign