[Biopython] History, Efetch, and returned records limits
Jukka-Pekka Verta
jp.verta at gmail.com
Mon Apr 16 15:23:25 EDT 2012
Hello fellow BioPythoneers,
I stumbled upon the same problem as Mariam (without reading your previous correspondence) while I was trying to fetch all Picea sitchensis nucleotide records. Following Peters code (epost+efetch), I still had the problem of fetch breakup (after 7000 sequences). The problem was fixed following Peters idea of simply retrying the failed search using try/except.
A collective thank you!
JP
def fetchFasta(species,out_file): # script by Peter Cock with enhancement
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "jp.verta at gmail.com"
search_handle = Entrez.esearch(db="nuccore",term=species+"[orgn]", retmax="20000")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
print count
assert count == len(gi_list), len(gi_list)
out_handle = open(out_file, "a")
batch_size = 1000
for start in range(0,count,batch_size):
end = min(count, start+batch_size)
batch = gi_list[start:end]
print "Going to download record %i to %i using epost+efetch" %(start+1, end)
post_results = Entrez.read(Entrez.epost("nuccore", id=",".join(batch)))
webenv = post_results["WebEnv"]
query_key = post_results["QueryKey"]
fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key)
data = fetch_handle.read()
try:
assert data.startswith(">"), data
fetch_handle.close()
out_handle.write(data)
except:
fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key)
data = fetch_handle.read()
assert data.startswith(">"), data
fetch_handle.close()
out_handle.write(data)
print "Done"
out_handle.close()
$ ./FetchFastaWithSpeciesName.py "Picea sitchensis" sitkaSequences.fa
19997
Going to download record 1 to 1000 using epost+efetch
Going to download record 1001 to 2000 using epost+efetch
Going to download record 2001 to 3000 using epost+efetch
Going to download record 3001 to 4000 using epost+efetch
Going to download record 4001 to 5000 using epost+efetch
Going to download record 5001 to 6000 using epost+efetch
Going to download record 6001 to 7000 using epost+efetch
Going to download record 7001 to 8000 using epost+efetch
Going to download record 8001 to 9000 using epost+efetch
Going to download record 9001 to 10000 using epost+efetch
Going to download record 10001 to 11000 using epost+efetch
Going to download record 11001 to 12000 using epost+efetch
Going to download record 12001 to 13000 using epost+efetch
Going to download record 13001 to 14000 using epost+efetch
Going to download record 14001 to 15000 using epost+efetch
Going to download record 15001 to 16000 using epost+efetch
Going to download record 16001 to 17000 using epost+efetch
Going to download record 17001 to 18000 using epost+efetch
Going to download record 18001 to 19000 using epost+efetch
Going to download record 19001 to 19997 using epost+efetch
Done
On 2012-04-14, at 1:39 PM, Peter Cock wrote:
>
> This is how I believe the NCBI expect this task to be done.
> In this specific case it seems to be an NCBI failure.
> Perhaps a loop to retry the efetch two or three times might
> work? It could be the whole history session breaks at the
> NCBI end though...
>
> A somewhat brute force approach would be to do the
> search (don't bother with the history) and get the 10313
> GI numbers. Then use epost+efetch to grab the records
> in batches of say 1000.
>
> Peter
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list