[Bioperl-l] get_Stream_by_query Terminates Prematurely

Chris Fields cjfields at illinois.edu
Mon May 10 16:58:07 UTC 2010


500000 sequences is way too many to request, even in a loop.  Under most circumstances this is breaking NCBI's eutils policies:

http://eutils.ncbi.nlm.nih.gov/#UserSystemRequirements

so don't be too surprised this is failing (this would be around 1000 queried of 500 sequences per query).  

You could try pulling down the raw sequence via batch entrez or using Bio::DB::EUtilities (which should die if an error occurs).

chris

On May 9, 2010, at 9:22 PM, bergeycm wrote:

> 
> Hi all,
> 
> I'm attempting to query GenBank for all sequences' lengths for a given
> taxon. I'm using get_Stream_by_query(), but only to grab the species,
> length, and accession. The genus of interest has almost 500,000 GB entries,
> though, and my code hangs up at odd points in the info-gathering loop.
> (Often after only 300 or 400 iterations.) The problem is that
> $stream_obj->next_seq (of Bio::SeqIO::genbank) eventually comes back
> undefined.
> 
> I've tried wrapping the next_seq portion of the code in an eval block, but
> to no avail. Is there a way to split a query into a bunch of small streams
> that aren't too much to ask? Or is there a way to pick up a dropped SeqIO
> stream? I think the connection is timing out and the stream is being lost.
> Any advice is greatly appreciated, as I'm fairly new to BioPerl.
> 
> - bergeycm
> 
> 
> 
> use Bio::DB::GenBank;
> use Bio::DB::Query::GenBank;
> 
> 
> # Get general things ready to go for querying GenBank
> my %options;
> $options{'-maxids'} = '500000';		# There are presently 460,184 sequences
> $options{'-db'} = 'nucleotide';
> $options{'-query'} = "Pongo [ORGN]";	# Orangutans
> 
> 
> my $query_obj = Bio::DB::Query::GenBank->new(%options);	
> my $total = $query_obj->count;
> 
> my $gb_obj = Bio::DB::GenBank->new();
> my $stream_obj = $gb_obj->get_Stream_by_query($query_obj);
> 
> # Restrict info to just what I'll be using. No sequence necessary.
> my $builder = $stream_obj->sequence_builder();
> $builder->want_none();
> $builder->add_wanted_slot('species','length','accession');
> 
> my $c = 0;
> 
> for (1 .. $total) {
> 	eval {
> 		my $seq_obj =  $stream_obj->next_seq;
> 		my $flavor = $seq_obj->species;			
> 		print $c, "\t", $flavor->scientific_name, " (", $flavor->id, ")\t",
> $seq_obj->length, "\t", $seq_obj->accession, "\n";			
> 	};
> 
> 	if ($@) {
> 		print $!, '\n';
> 	}
> 	
> 	# Pause for a little over a third of a second
> 	select(undef, undef, undef, 0.35);
> 	
> 	$c++;
> }
> 
> 
> 
> -- 
> View this message in context: http://old.nabble.com/get_Stream_by_query-Terminates-Prematurely-tp28506482p28506482.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list