[Biopython] Searching for and downloading sequences using the history

Carlos Javier Borroto carlos.borroto at gmail.com
Fri Sep 18 18:08:14 UTC 2009


On Fri, Sep 18, 2009 at 1:15 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
> On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto
> <carlos.borroto at gmail.com> wrote:
>> Hi all,
>>
>> I'm trying to download all of the EST from a specie, I'm following the
>> example on the tutorial which seems to be exactly what I need. But I
>> running into this problem:
>>
>
> I'm right? I'm going to implement this I share it here.
>

Well here is my implementation, I'm very new to biopython or even
python, my programing skills aren't great either, but because what I
did was mostly copy/paste from the tutorial, I'm feeling confident on
sharing this code, any advise to make it better is highly welcome.

It seems to be working just fine, but I haven't been able to run it to
the end, cause I keep getting this sporadic ncbi servers errors:
Going to download records 5601 to 5700
Traceback (most recent call last):
  File "ncbi-downloader.py", line 58, in <module>
    webenv=webenv, query_key=query_key)
  File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py",
line 126, in efetch
    return _open(cgi, variables)
  File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py",
line 373, in _open
    raise IOError(data.strip())
IOError: Error: 156514070 is not available at this time.

Error: 156514069 is not available at this time.

Error: 156514068 is not available at this time.

Error: 156514067 is not available at this time.

But I guess is only matter of been a better citizen and doing this in
the weekend or outside USA peak time.

Here is the code:

from Bio import Entrez
Entrez.email = "A.N.Other at example.com"

dbname = "code_name_of_the_db"
query_term = "query_term"

handle = Entrez.egquery(term=query_term)
record = Entrez.read(handle)
handle.close()

for row in record["eGQueryResult"]:
        if row["DbName"]==dbname:
                egquery_count = int(row["Count"])

esearch_batch_size = 1000
out_handle = open("outfile.fasta", "w")
for esearch_start in range(0,egquery_count,esearch_batch_size) :
        esearch_end = min(egquery_count, esearch_start+esearch_batch_size)
        print "Going to get IDs of records %i to %i" %
(esearch_start+1, esearch_end)
        search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",

retstart=esearch_start,retmax=esearch_batch_size)
        search_results = Entrez.read(search_handle)
        search_handle.close()

        gi_list = search_results["IdList"]
        assert esearch_batch_size == len(gi_list)
        count = int(search_results["Count"])
        assert egquery_count == count
        webenv = search_results["WebEnv"]
        query_key = search_results["QueryKey"]

        batch_size = 100
        for start in
range(esearch_start,esearch_start+esearch_batch_size,batch_size) :
                end = min(count, start+batch_size)
                print "Going to download records %i to %i" % (start+1, end)
                fetch_handle = Entrez.efetch(db=dbname, rettype="fasta",
                                             retstart=start, retmax=batch_size,
                                             webenv=webenv, query_key=query_key)
                data = fetch_handle.read()
                fetch_handle.close()
                out_handle.write(data)
out_handle.close()

regards,
-- 
Carlos Javier Borroto
Baltimore, MD
Phone: (410) 929 4020



More information about the Biopython mailing list