[Biopython] Searching for and downloading sequences using the history
Carlos Javier Borroto
carlos.borroto at gmail.com
Fri Sep 18 18:08:14 UTC 2009
On Fri, Sep 18, 2009 at 1:15 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
> On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto
> <carlos.borroto at gmail.com> wrote:
>> Hi all,
>>
>> I'm trying to download all of the EST from a specie, I'm following the
>> example on the tutorial which seems to be exactly what I need. But I
>> running into this problem:
>>
>
> I'm right? I'm going to implement this I share it here.
>
Well here is my implementation, I'm very new to biopython or even
python, my programing skills aren't great either, but because what I
did was mostly copy/paste from the tutorial, I'm feeling confident on
sharing this code, any advise to make it better is highly welcome.
It seems to be working just fine, but I haven't been able to run it to
the end, cause I keep getting this sporadic ncbi servers errors:
Going to download records 5601 to 5700
Traceback (most recent call last):
File "ncbi-downloader.py", line 58, in <module>
webenv=webenv, query_key=query_key)
File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py",
line 126, in efetch
return _open(cgi, variables)
File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py",
line 373, in _open
raise IOError(data.strip())
IOError: Error: 156514070 is not available at this time.
Error: 156514069 is not available at this time.
Error: 156514068 is not available at this time.
Error: 156514067 is not available at this time.
But I guess is only matter of been a better citizen and doing this in
the weekend or outside USA peak time.
Here is the code:
from Bio import Entrez
Entrez.email = "A.N.Other at example.com"
dbname = "code_name_of_the_db"
query_term = "query_term"
handle = Entrez.egquery(term=query_term)
record = Entrez.read(handle)
handle.close()
for row in record["eGQueryResult"]:
if row["DbName"]==dbname:
egquery_count = int(row["Count"])
esearch_batch_size = 1000
out_handle = open("outfile.fasta", "w")
for esearch_start in range(0,egquery_count,esearch_batch_size) :
esearch_end = min(egquery_count, esearch_start+esearch_batch_size)
print "Going to get IDs of records %i to %i" %
(esearch_start+1, esearch_end)
search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",
retstart=esearch_start,retmax=esearch_batch_size)
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
assert esearch_batch_size == len(gi_list)
count = int(search_results["Count"])
assert egquery_count == count
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 100
for start in
range(esearch_start,esearch_start+esearch_batch_size,batch_size) :
end = min(count, start+batch_size)
print "Going to download records %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db=dbname, rettype="fasta",
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()
regards,
--
Carlos Javier Borroto
Baltimore, MD
Phone: (410) 929 4020
More information about the Biopython
mailing list