[Biopython-dev] EUtils

Tue Apr 6 18:31:14 EDT 2004

Hey Andrew;
Great to hear from you, as always! Hope you're doing well.

> I was at the PyCon conference a couple weeks ago and talked with
> Mark Johnson from NCBI about several things, including the EUtils
> client package in Biopython.
> 
> That reminded me that I needed to clean up the package and update
> it to support the latest NCBI interface (they added a couple new
> features in the last two years).  I've got some free time this
> week so I'm working on it.

Sweet -- that sounds great. Always glad to have work on it. I know a
few small fixes have gone into the CVS so you may want to work
directly from it, but other than that it's all yours.

>   - don't require the response the DTD.  I now believe that DTDs
>      for this task are the wrong approach.  I'm using elementtree.

Cool. This will also clean up a lot of the install messiness in
setup.py necessary to get the DTDs installed.

>   - figure out a better way to do live tests against the server

If you get somewhere with this please do let me know -- I added some
tests for Registry code recently but have been fairly frustrated
with getting errors because the server is dying. One thing I have
been thinking about is using timeoutsocket:

http://www.timo-tasi.org/python/timeoutsocket.py

to just die neatly on tests if the server is timing out.

> For the last, what I want to do is move the documentation
> and tests out of Bio/EUtils/ and into the proper places for
> Biopython.  As stands EUtils is distributable independent of
> Biopython.  I don't think that's worthwhile so I'm thinking
> of no longer doing that.  

Sounds great -- I've been adding independent documentation into
Docs/cookbook. You'd just need to make a directory for EUtils.

> Is anyone here using the EUtils client for anything and
> has code that I can use?  

I've recently added it within the registry system in
config/DBRegistry.py (class EUtilsDB), so that we can do retrieval
from NCBI in the current right way.

Other then that I've used it in a couple of talks I've given:

http://www.biopython.org/docs/presentations/biopython_exelixis.pdf

and:

http://www.biopython.org/docs/presentations/bosc_biopython.pdf

and I attached a file which I use semi-regularly for retrieving
databases by Taxonomy ID from NCBI (which actually used the
timeoutsocket module I mention above, but from my own personal code
base I use at work).

Hope those help some. Glad to see you around.
Brad
-------------- next part --------------
#!/usr/bin/env python
"""Fetch a number of FASTA files from NCBI based on a taxonomy queries.

This script is meant to be used to scheduled batch downloads of a number
of databases based on taxonomy ids. It reads an input file specified on the
commandline that looks like:

file_name,tax_id,not_tax_id,not_tax_id

Where:
    file_name -- the name of the output file (ie. arabidopsis.fasta)
    tax_id -- the primary taxonmy id we are fetching (ie. 3328)
    not_tax_id -- a list of taxonomy ids to not include. This list can
    be empty or as long as you want.

Usage:
    python get_taxonomy_list.py <tax info file>

Where:
    <tax info file> -- the location of the file to read the taxonomy information
    from, formatted as above

The files are output to a directory called "taxa-<date>" where <date> is the
date the program was run on (ie. 20020904).
"""
import sys
import os
import sgmllib
import time
import urlparse
import time

# biopython
from Bio import Fasta
from Bio.EUtils import HistoryClient

from Bio.PGML.Utils import timeoutsocket
timeoutsocket.setDefaultSocketTimeout(30)

VERBOSE = 1

def main(tax_file):
    assert os.path.exists(tax_file), "Cannot find taxonomy file"
    # create the output directory
    start_dir = os.path.dirname(tax_file)
    time_info = time.localtime(time.time())
    time_name = "%i%02i%02i" % (time_info[0], time_info[1], time_info[2])
    out_name = "taxa-%s" % time_name
    output_dir = os.path.join(start_dir, out_name)
    if not(os.path.exists(output_dir)):
        os.makedirs(output_dir)

    all_tax_info = process_tax_file(tax_file)
    for file_name, tax_id, not_ids in all_tax_info:
        output_file = os.path.join(output_dir, file_name)
        process_taxonomy_id(tax_id, not_ids, output_file)

def process_tax_file(tax_file):
    """Split a taxonomy information file up into relevant info from commas.
    """
    tax_info = []
    handle = open(tax_file, "r")
    for line in handle.xreadlines():
        line = line.strip()
        if line and line.find("#") == -1:
            parts = line.split(",")
            if len(parts) < 2:
                raise ValueError("Line is badly formatted: %s" % line)
            filename = parts[0]
            tax_id = parts[1]
            not_ids = parts[2:]
            tax_info.append((filename, tax_id, not_ids))

    handle.close()
    return tax_info

def process_taxonomy_id(taxonomy_id, not_ids, output_file):
    """Deal with the process of retrieving the FASTA file for a query.

    This uses the new EUtils interface at NCBI, which is so much faster
    and less annoying then the old way.
    """
    if VERBOSE:
        print "Saving taxonomy id %s to %s" % (taxonomy_id, output_file)

    # build the query
    query = "txid%s[Organism]" % taxonomy_id
    for not_id in not_ids:
        query += " NOT txid%s[Organism]" % not_id

    client = HistoryClient.HistoryClient()
    while 1:
        try:
            results = client.search(query, db = "nucleotide")
            break
        except (timeoutsocket.Timeout, timeoutsocket.Error, ValueError):
            print "Timed out talking to NCBI, trying again"
            time.sleep(10)

    if VERBOSE:
        print "\tSearch got %s results" % (len(results))

    while 1:
        try:
            fasta = results.efetch(retmode = "text", rettype = "fasta")
            break
        except (timeoutsocket.Timeout, timeoutsocket.Error, ValueError):
            print "Timed out talking to NCBI, trying again"
            time.sleep(10)

    output_handle = open(output_file, "w")
    while 1:
        line = fasta.readline()
        if not(line):
            break
        output_handle.write(line)
    output_handle.close()

    if VERBOSE:
        output_handle = open(output_file)
        num_seqs = check_seqs(output_handle)
        output_handle.close()
        print "\tWrote %s sequences" % (num_seqs)

def check_seqs(handle):
    """Count the number of FASTA sequences in the passed handle.
    """
    num_seqs = 0
    iterator = Fasta.Iterator(handle)
    while 1:
        rec = iterator.next()
        if not rec:
            break
        num_seqs += 1

    return num_seqs

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print "Invalid number of arguments supplied"
        print __doc__
        sys.exit()

    sys.exit(main(sys.argv[1]))