[Bioperl-l] Bio::DB::GenBank and large number of requests

Alexander Kozik akozik at atgc.org
Tue Jan 29 18:32:41 UTC 2008


you don't need to use bioperl to accomplish this task, to download 
several thousand sequences based on accession ID list.

NCBI batch Entrez can do that:
http://www.ncbi.nlm.nih.gov/sites/batchentrez

just submit a large list of IDs, select database, and download.

you can submit ~50,000 IDs in one file usually without problems.
it may not return results if a list is larger than ~100,000 IDs

--
Alexander Kozik
Bioinformatics Specialist
Genome and Biomedical Sciences Facility
451 Health Sciences Drive
Genome Center, 4-th floor, room 4302
University of California
Davis, CA 95616-8816
Phone: (530) 754-9127
email#1: akozik at atgc.org
email#2: akozik at gmail.com
web: http://www.atgc.org/



Chris Fields wrote:
> Yes, you can only retrieve ~500 sequences at a time using either 
> Bio::DB::GenBank.  Both Bio::DB::GenBank and Bio::DB::EUtilities 
> interact with NCBI's EUtilities (the former module returns raw data from 
> the URL to be processed later, the latter module returns 
> Bio::Seq/Bio::SeqIO objects).
> 
> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.large-datasets 
> 
> 
> You can usually post more IDs using epost and fetch sequence referring 
> to the WebEnv/key combo (batch posting).  I try to make this a bit 
> easier with EUtilities but it is woefully lacking in documentation (my 
> fault), but there is some code up on the wiki which should work.
> 
> chris
> 
> On Jan 29, 2008, at 11:19 AM, Tristan Lefebure wrote:
> 
>> Hello,
>>
>> I would like to download a large number of sequences from GenBank 
>> (122,146 to be exact) following a list of accession numbers.
>> I first investigated around Bio::DB::EUtilities, but got lost and 
>> finally used Bio::DB::GenBank.
>> My script works well for short request, but it gives the following 
>> error with the long request:
>>
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: WebDBSeqI Request Error:
>> 500 short write
>> Content-Type: text/plain
>> Client-Date: Tue, 29 Jan 2008 17:22:46 GMT
>> Client-Warning: Internal response
>>
>> 500 short write
>>
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw 
>> /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:359
>> STACK: Bio::DB::WebDBSeqI::_request 
>> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:685
>> STACK: Bio::DB::WebDBSeqI::get_seq_stream 
>> /usr/local/share/perl/5.8.8/Bio/DB/WebDBSeqI.pm:472
>> STACK: Bio::DB::NCBIHelper::get_Stream_by_acc 
>> /usr/local/share/perl/5.8.8/Bio/DB/NCBIHelper.pm:361
>> STACK: ./fetch_from_genbank.pl:58
>> ---------------------------------------------------------
>>
>> Does that mean that we can only fetch 500 sequences at a time?
>> Should I split my list in 500 ids framents and submit them one after 
>> the other?
>>
>> Any suggestions very welcomed...
>> Thanks,
>> -Tristan
>>
>>
>> Here is the script:
>>
>> ##################################
>> use strict;
>> use warnings;
>> use Bio::DB::GenBank;
>> # use Bio::DB::EUtilities;
>> use Bio::SeqIO;
>> use Getopt::Long;
>>
>> # 2008-01-22 T Lefebure
>> # I tried to use Bio::DB::EUtilities without much succes and get back 
>> to Bio::DB::GenBank.
>> # The following procedure is not really good as the stream is first 
>> copied to a temporary file,
>> # and than re-used by BioPerl to generate the final file.
>>
>> my $db = 'nucleotide';
>> my $format = 'genbank';
>> my $help= '';
>> my $dformat = 'gb';
>>
>> GetOptions(
>>     'help|?' => \$help,
>>     'format=s'  => \$format,
>>     'database=s'    => \$db,
>> );
>>
>>
>> my $printhelp = "\nUsage: $0 [options] <list of ids or acc> <output>
>>
>> Will download the corresponding data from GenBank. BioPerl is required.
>>
>> Options:
>>     -h
>>         print this help
>>     -format: genbank|fasta|...
>>         give output format (default=genbank)
>>     -database: nucleotide|genome|protein|...
>>         define the database to search in (default=nucleotide)
>>
>> The full description of the options can be find at 
>> http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html\n";
>>
>> if ($#ARGV<1) {
>>     print $printhelp;
>>     exit;
>> }
>>
>> open LIST, $ARGV[0];
>> my @list = <LIST>;
>>
>> if ($format eq 'fasta') { $dformat = 'fasta' }
>>
>> my $gb = new Bio::DB::GenBank(    -retrievaltype => 'tempfile',
>>                 -format => $dformat,
>>                 -db => $db,
>>             );
>> my $seqio = $gb->get_Stream_by_acc(\@list);
>>
>> my $seqout = Bio::SeqIO->new( -file => ">$ARGV[1]",
>>                 -format => $format,
>>             );
>> while (my $seqo = $seqio->next_seq ) {
>>     print $seqo->id, "\n";
>>     $seqout->write_seq($seqo);
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list