[Biopython-dev] EUtils
Brad Chapman
chapmanb at uga.edu
Tue Apr 6 18:31:14 EDT 2004
Hey Andrew;
Great to hear from you, as always! Hope you're doing well.
> I was at the PyCon conference a couple weeks ago and talked with
> Mark Johnson from NCBI about several things, including the EUtils
> client package in Biopython.
>
> That reminded me that I needed to clean up the package and update
> it to support the latest NCBI interface (they added a couple new
> features in the last two years). I've got some free time this
> week so I'm working on it.
Sweet -- that sounds great. Always glad to have work on it. I know a
few small fixes have gone into the CVS so you may want to work
directly from it, but other than that it's all yours.
> - don't require the response the DTD. I now believe that DTDs
> for this task are the wrong approach. I'm using elementtree.
Cool. This will also clean up a lot of the install messiness in
setup.py necessary to get the DTDs installed.
> - figure out a better way to do live tests against the server
If you get somewhere with this please do let me know -- I added some
tests for Registry code recently but have been fairly frustrated
with getting errors because the server is dying. One thing I have
been thinking about is using timeoutsocket:
http://www.timo-tasi.org/python/timeoutsocket.py
to just die neatly on tests if the server is timing out.
> For the last, what I want to do is move the documentation
> and tests out of Bio/EUtils/ and into the proper places for
> Biopython. As stands EUtils is distributable independent of
> Biopython. I don't think that's worthwhile so I'm thinking
> of no longer doing that.
Sounds great -- I've been adding independent documentation into
Docs/cookbook. You'd just need to make a directory for EUtils.
> Is anyone here using the EUtils client for anything and
> has code that I can use?
I've recently added it within the registry system in
config/DBRegistry.py (class EUtilsDB), so that we can do retrieval
from NCBI in the current right way.
Other then that I've used it in a couple of talks I've given:
http://www.biopython.org/docs/presentations/biopython_exelixis.pdf
and:
http://www.biopython.org/docs/presentations/bosc_biopython.pdf
and I attached a file which I use semi-regularly for retrieving
databases by Taxonomy ID from NCBI (which actually used the
timeoutsocket module I mention above, but from my own personal code
base I use at work).
Hope those help some. Glad to see you around.
Brad
-------------- next part --------------
#!/usr/bin/env python
"""Fetch a number of FASTA files from NCBI based on a taxonomy queries.
This script is meant to be used to scheduled batch downloads of a number
of databases based on taxonomy ids. It reads an input file specified on the
commandline that looks like:
file_name,tax_id,not_tax_id,not_tax_id
Where:
file_name -- the name of the output file (ie. arabidopsis.fasta)
tax_id -- the primary taxonmy id we are fetching (ie. 3328)
not_tax_id -- a list of taxonomy ids to not include. This list can
be empty or as long as you want.
Usage:
python get_taxonomy_list.py <tax info file>
Where:
<tax info file> -- the location of the file to read the taxonomy information
from, formatted as above
The files are output to a directory called "taxa-<date>" where <date> is the
date the program was run on (ie. 20020904).
"""
import sys
import os
import sgmllib
import time
import urlparse
import time
# biopython
from Bio import Fasta
from Bio.EUtils import HistoryClient
from Bio.PGML.Utils import timeoutsocket
timeoutsocket.setDefaultSocketTimeout(30)
VERBOSE = 1
def main(tax_file):
assert os.path.exists(tax_file), "Cannot find taxonomy file"
# create the output directory
start_dir = os.path.dirname(tax_file)
time_info = time.localtime(time.time())
time_name = "%i%02i%02i" % (time_info[0], time_info[1], time_info[2])
out_name = "taxa-%s" % time_name
output_dir = os.path.join(start_dir, out_name)
if not(os.path.exists(output_dir)):
os.makedirs(output_dir)
all_tax_info = process_tax_file(tax_file)
for file_name, tax_id, not_ids in all_tax_info:
output_file = os.path.join(output_dir, file_name)
process_taxonomy_id(tax_id, not_ids, output_file)
def process_tax_file(tax_file):
"""Split a taxonomy information file up into relevant info from commas.
"""
tax_info = []
handle = open(tax_file, "r")
for line in handle.xreadlines():
line = line.strip()
if line and line.find("#") == -1:
parts = line.split(",")
if len(parts) < 2:
raise ValueError("Line is badly formatted: %s" % line)
filename = parts[0]
tax_id = parts[1]
not_ids = parts[2:]
tax_info.append((filename, tax_id, not_ids))
handle.close()
return tax_info
def process_taxonomy_id(taxonomy_id, not_ids, output_file):
"""Deal with the process of retrieving the FASTA file for a query.
This uses the new EUtils interface at NCBI, which is so much faster
and less annoying then the old way.
"""
if VERBOSE:
print "Saving taxonomy id %s to %s" % (taxonomy_id, output_file)
# build the query
query = "txid%s[Organism]" % taxonomy_id
for not_id in not_ids:
query += " NOT txid%s[Organism]" % not_id
client = HistoryClient.HistoryClient()
while 1:
try:
results = client.search(query, db = "nucleotide")
break
except (timeoutsocket.Timeout, timeoutsocket.Error, ValueError):
print "Timed out talking to NCBI, trying again"
time.sleep(10)
if VERBOSE:
print "\tSearch got %s results" % (len(results))
while 1:
try:
fasta = results.efetch(retmode = "text", rettype = "fasta")
break
except (timeoutsocket.Timeout, timeoutsocket.Error, ValueError):
print "Timed out talking to NCBI, trying again"
time.sleep(10)
output_handle = open(output_file, "w")
while 1:
line = fasta.readline()
if not(line):
break
output_handle.write(line)
output_handle.close()
if VERBOSE:
output_handle = open(output_file)
num_seqs = check_seqs(output_handle)
output_handle.close()
print "\tWrote %s sequences" % (num_seqs)
def check_seqs(handle):
"""Count the number of FASTA sequences in the passed handle.
"""
num_seqs = 0
iterator = Fasta.Iterator(handle)
while 1:
rec = iterator.next()
if not rec:
break
num_seqs += 1
return num_seqs
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Invalid number of arguments supplied"
print __doc__
sys.exit()
sys.exit(main(sys.argv[1]))
More information about the Biopython-dev
mailing list