[BioPython] NCBI: from protein to CDS

Fri Apr 11 01:36:55 EDT 2003

Iddo:
>> Can anyone suggest a painless way to retrieve a coding sequence using 
>> a protein gi number via NCBI? manually that would mean to:
>> 1) go to the protein page using the protein gi.
>> 2) Click on the CDS
>> 3) Get the DNA sequence coding to it.

Hmm..  Here's one way using my EUtils package, available from
   http://www.dalkescientific.com/EUtils/

  =-=-=-=-=-=-=-=-=

from EUtils import DBIds, DBIdsClient
from Bio import GenBank
from Bio.GenBank import LocationParser

# Get a handle to the protein record
h = DBIdsClient.from_dbids(DBIds(db = "protein", ids = ["10956263"]))

# Fetch returns a file handle in GenPept format
infile = h.efetch(retmode = "text", rettype = "gp")

# Parse the first record
rec = GenBank.Iterator(infile, GenBank.FeatureParser()).next()

# Get the CDS feature
for feature in rec.features:
     if feature.type == "CDS":
         break
else:
     raise "CDS not found"

# Get the location qualifier (note -- need to be fixed since there
# may be multiple qualifiers with the same name)
loc = feature.qualifiers["coded_by"][0]

# This returns a string like
#     NC_001496.1:22188..23897

# little helper function to look up a sequence given the
# sequence accession, start, and stop location.  Locations
# are given in NCBI coordinates, not Python ones.
def lookup(name, seq_start, seq_stop):
     h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name]))
     return h.efetch(retmode = "text", rettype = "fasta",
                     seq_start = seq_start, seq_stop = seq_stop).read()

# Parse the location string
parsed_loc = LocationParser.parse(LocationParser.scan(loc))

# Double-check that I know how to handle it.  Could be
# extended to handle more cases.
assert isinstance(parsed_loc, LocationParser.AbsoluteLocation)
assert isinstance(parsed_loc.local_location, LocationParser.Range)

seq_start = parsed_loc.local_location.low
seq_stop = parsed_loc.local_location.high
assert isinstance(seq_start, LocationParser.Integer)
assert isinstance(seq_stop, LocationParser.Integer)
seq_start = seq_start.val
seq_stop = seq_stop.val

# Get the actual sequence.  In this case, in FASTA format.
print lookup(parsed_loc.path.accession, seq_start, seq_stop)

  =-=-=-=-=-=-=-=-=

I started to work on a way to assemble a sequence based on the
location information, but bailed and decided to handle only one
segment.

EUtils doesn't yet have throttling support - see Jason's comments
(and the EUtils distribution) about NCBI's use restrictions.

					Andrew