[Biopython] History, Efetch, and returned records limits

Jukka-Pekka Verta jp.verta at gmail.com
Mon Apr 16 15:23:25 EDT 2012


Hello fellow BioPythoneers,

I stumbled upon the same problem as Mariam (without reading your previous correspondence) while I was trying to fetch all Picea sitchensis nucleotide records. Following Peters code (epost+efetch), I still had the problem of fetch breakup (after 7000 sequences). The problem was fixed following Peters idea of simply retrying the failed search using try/except.

A collective thank you!

JP

def fetchFasta(species,out_file): # script by Peter Cock with enhancement
	from Bio import Entrez
	from Bio import SeqIO
	Entrez.email = "jp.verta at gmail.com"
	search_handle = Entrez.esearch(db="nuccore",term=species+"[orgn]", retmax="20000")
	search_results = Entrez.read(search_handle)
	search_handle.close()
	gi_list = search_results["IdList"]
	count = int(search_results["Count"])
	print count
	assert count == len(gi_list), len(gi_list)
	out_handle = open(out_file, "a")
	batch_size = 1000
	for start in range(0,count,batch_size):
		end = min(count, start+batch_size)
		batch = gi_list[start:end]
		print "Going to download record %i to %i using epost+efetch" %(start+1, end)
		post_results = Entrez.read(Entrez.epost("nuccore", id=",".join(batch)))
		webenv = post_results["WebEnv"]
		query_key = post_results["QueryKey"]
		fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key)
		data = fetch_handle.read()
		try: 
			assert data.startswith(">"), data
			fetch_handle.close()
			out_handle.write(data)
		except:
			fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key)
			data = fetch_handle.read()
			assert data.startswith(">"), data
			fetch_handle.close()
			out_handle.write(data)
	print "Done"
	out_handle.close()		
   

$ ./FetchFastaWithSpeciesName.py "Picea sitchensis" sitkaSequences.fa
19997
Going to download record 1 to 1000 using epost+efetch
Going to download record 1001 to 2000 using epost+efetch
Going to download record 2001 to 3000 using epost+efetch
Going to download record 3001 to 4000 using epost+efetch
Going to download record 4001 to 5000 using epost+efetch
Going to download record 5001 to 6000 using epost+efetch
Going to download record 6001 to 7000 using epost+efetch
Going to download record 7001 to 8000 using epost+efetch
Going to download record 8001 to 9000 using epost+efetch
Going to download record 9001 to 10000 using epost+efetch
Going to download record 10001 to 11000 using epost+efetch
Going to download record 11001 to 12000 using epost+efetch
Going to download record 12001 to 13000 using epost+efetch
Going to download record 13001 to 14000 using epost+efetch
Going to download record 14001 to 15000 using epost+efetch
Going to download record 15001 to 16000 using epost+efetch
Going to download record 16001 to 17000 using epost+efetch
Going to download record 17001 to 18000 using epost+efetch
Going to download record 18001 to 19000 using epost+efetch
Going to download record 19001 to 19997 using epost+efetch
Done


On 2012-04-14, at 1:39 PM, Peter Cock wrote:
> 
> This is how I believe the NCBI expect this task to be done.
> In this specific case it seems to be an NCBI failure.
> Perhaps a loop to retry the efetch two or three times might
> work? It could be the whole history session breaks at the
> NCBI end though...
> 
> A somewhat brute force approach would be to do the
> search (don't bother with the history) and get the 10313
> GI numbers. Then use epost+efetch to grab the records
> in batches of say 1000.
> 
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython




More information about the Biopython mailing list