[BioPython] Querying Entrez Gene

Palle Villesen palle at birc.au.dk
Thu Oct 12 06:59:26 UTC 2006


Luca Beltrame wrote:
> Hello.
> I'm currently in need of querying the Entrez Gene database using a list of IDs 
> I have. After searching in the Biopython documentation, I have found no 
> indication of whether that is possible or not. 
> Is there a way to query NCBI's Entrez Gene database? 
> Thanks in advance.
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
EUtils are also a part of BioPython. You should look in the biopython 
tutorial for how to use it. Below is my own small "mass downloader" 
utility in python. (Running on a non-administrator install of both 
python and biopython).

The basic function/module you need is the HistoryClient that can search 
and retrieve large sets - instead of looping through all your ids one at 
a time. Anyway - check the tutorial, it's quite good (at least for a 
person with the same very basic python knowledge as me).

sincerely,
Palle Villesen, BiRC, DK

Program: gb_search.py
-------------------

#!/web/biopv/usr/local/bin/python

import sys
import time

biopython_path='/web/biopv/usr/local/lib/python'
sys.path.insert(0,biopython_path)

def help():
    from Bio.EUtils import Config
    dbs=" ".join(Config.databases.keys())
    help= """
GenBank retrieve tool.

Usage:

gb_search.py QUERY [RECS] [DB] [FORMAT]

QUERY   : the entrez query enclosed in " "
RECS    : Number of records/sequences to get at a time (default=20)
DB      : Database, (default='nucleotide')
          (%s)
         
Format  : Record format (default='fasta', but 'docsum', 'brief', 'gi' - 
and many others are available)
    """ % dbs

    sys.exit(help)
    return 0

# Default values
step=20
database="nucleotide"
format="fasta"
time2sleep=3

if len(sys.argv) ==1:
    help()

search_term=sys.argv[1]
if len(sys.argv)>2 : step=int(sys.argv[2])
if len(sys.argv)>3 : database=sys.argv[3]
if len(sys.argv)>4 : format=sys.argv[4]
if len(sys.argv)>5 : time2sleep=int(sys.argv[5])

from Bio.EUtils import HistoryClient
s = HistoryClient.HistoryClient().search(search_term,db=database)
print >>sys.stderr, "Getting %s seqs, %s sequences at a time" % 
(len(s),step)
i=0
while i<len(s):
    print >>sys.stderr, "Getting sequences from ",i,"to",min(i+step,len(s)),
    print s[i:i+step].efetch(retmode = "text", rettype = format).read()
    if i+step > len(s):
        print >>sys.stderr, "..done"
        break
    print >>sys.stderr, "...done (sleeping %s seconds)" % time2sleep
    i+=step
    time.sleep(time2sleep)
 
-------------------------------------

-- 

-._    _.--'"`'--._    _.--'"`'--._    _.--'"`'--._    _ 
    '-:`.'|`|"':-.  '-:`.'|`|"':-.  '-:`.'|`|"':-.  '.` : '.   
  '.  '.  | |  | |'.  '.  | |  | |'.  '.  | |  | |'.  '.:   '.  '.
  : '.  '.| |  | |  '.  '.| |  | |  '.  '.| |  | |  '.  '.  : '.  `.
  '   '.  `.:_ | :_.' '.  `.:_ | :_.' '.  `.:_ | :_.' '.  `.'   `.
         `-..,..-'       `-..,..-'       `-..,..-'       `         `

   Palle Villesen, Ph.D. 
   BiRC, Build. 090, University of Aarhus
   DK - 8000 Aarhus C, Denmark

   palle.retrosearch.dk - +45 61708600
---------------------------------------------------------------------




More information about the Biopython mailing list