[Biopython] History, Efetch, and returned records limits
Mariam Reyad Rizkallah
mrrizkalla at gmail.com
Sat Apr 14 09:36:29 EDT 2012
Hi Peter,
I am concerned with EST sequences, I will check if they are in the
gi_taxid_nucl mappers.
Please find my script, with 3 approaches and their results.
#!/usr/bin/python
> import sys
> from Bio import Entrez
> Entrez.email = "mariam.rizkallah at gmail.com"
> txid = int(sys.argv[1])
>
> #get count
> prim_handle = Entrez.esearch(db="nucest",term="txid%i[Organism:exp]"
> %(txid), retmax=20)
> prim_record = Entrez.read(prim_handle)
> prim_count = prim_record['Count']
>
> #get max using history (Biopython tutorial
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119)
> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]"
> %(txid), retmax=prim_count, usehistory="y")
> search_results = Entrez.read(search_handle)
> search_handle.close()
> gi_list = search_results["IdList"]
count = int(search_results["Count"])
> assert count == len(gi_list)
> webenv = search_results["WebEnv"]
> query_key = search_results["QueryKey"]
> out_fasta = "%s_txid%i_ct%i.fasta" %(sys.argv[2], txid, count)
> out_handle = open(out_fasta, "a")
>
> ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to
> obtain query #1</ERROR>
batch_size = 1000
> for start in range(0,count,batch_size):
> end = min(count, start+batch_size)
> print "Going to download record %i to %i" % (start+1, end)
> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
> retmode="text", retstart=start, retmax=batch_size, webenv=webenv,
> query_key=query_key)
> data = fetch_handle.read()
> fetch_handle.close()
> out_handle.write(data)
>
> ## Approach2: split list
def SplitList( list, chunk_size ) :
return [list[offs:offs+chunk_size] for offs in range(0, len(list),
> chunk_size)]
> z = SplitList(gi_list, 10000)
for i in range(0, len(z)):
> print len(z[i])
> post_handle = Entrez.epost("nucest", rettype="fasta", retmode="text",
> id=",".join(z[1]))
> webenv = search_results["WebEnv"]
> query_key = search_results["QueryKey"]
> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta",
> retmode="text", webenv=webenv, query_key=query_key)
> data = fetch_handle.read()
> fetch_handle.close()
> out_handle.write(data)
>
> ## Approach3: with most consistent retrieval but limited to 10000
fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text",
> webenv=webenv, query_key=query_key)
> data = fetch_handle.read()
> fetch_handle.close()
> out_handle.write(data)
out_handle.close()
On Sat, Apr 14, 2012 at 2:54 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah
> <mrrizkalla at gmail.com> wrote:
> > Dear community,
> >
> > I aim to get sequences by a list of gi (using efetch and history
> > variables), for a certain taxid (using esearch). I always get the first
> > 10,000 records. For example, I need10,300 gi_ids, I split them into list
> of
> > 10,000 and submit them consecutively and still getting the first 10,000
> > records. I tried batch approach in Biopython tutorial, didn't even reach
> > 10,000 sequences.
> >
> > Is there a limit for NCBI's returned sequences?
> >
> > Thank you.
> >
> > Mariam
>
> It does sound like you've found some sort of Entrez limit,
> it might be worth emailing the NCBI to clarify this.
>
> Have you considered downloading the GI/taxid mapping
> table from their FTP site instead? e.g.
> http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html
>
> Peter
>
More information about the Biopython
mailing list