[BioPython] NCBI: from protein to CDS
Andrew Dalke
dalke at dalkescientific.com
Fri Apr 11 01:36:55 EDT 2003
Iddo:
>> Can anyone suggest a painless way to retrieve a coding sequence using
>> a protein gi number via NCBI? manually that would mean to:
>> 1) go to the protein page using the protein gi.
>> 2) Click on the CDS
>> 3) Get the DNA sequence coding to it.
Hmm.. Here's one way using my EUtils package, available from
http://www.dalkescientific.com/EUtils/
=-=-=-=-=-=-=-=-=
from EUtils import DBIds, DBIdsClient
from Bio import GenBank
from Bio.GenBank import LocationParser
# Get a handle to the protein record
h = DBIdsClient.from_dbids(DBIds(db = "protein", ids = ["10956263"]))
# Fetch returns a file handle in GenPept format
infile = h.efetch(retmode = "text", rettype = "gp")
# Parse the first record
rec = GenBank.Iterator(infile, GenBank.FeatureParser()).next()
# Get the CDS feature
for feature in rec.features:
if feature.type == "CDS":
break
else:
raise "CDS not found"
# Get the location qualifier (note -- need to be fixed since there
# may be multiple qualifiers with the same name)
loc = feature.qualifiers["coded_by"][0]
# This returns a string like
# NC_001496.1:22188..23897
# little helper function to look up a sequence given the
# sequence accession, start, and stop location. Locations
# are given in NCBI coordinates, not Python ones.
def lookup(name, seq_start, seq_stop):
h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name]))
return h.efetch(retmode = "text", rettype = "fasta",
seq_start = seq_start, seq_stop = seq_stop).read()
# Parse the location string
parsed_loc = LocationParser.parse(LocationParser.scan(loc))
# Double-check that I know how to handle it. Could be
# extended to handle more cases.
assert isinstance(parsed_loc, LocationParser.AbsoluteLocation)
assert isinstance(parsed_loc.local_location, LocationParser.Range)
seq_start = parsed_loc.local_location.low
seq_stop = parsed_loc.local_location.high
assert isinstance(seq_start, LocationParser.Integer)
assert isinstance(seq_stop, LocationParser.Integer)
seq_start = seq_start.val
seq_stop = seq_stop.val
# Get the actual sequence. In this case, in FASTA format.
print lookup(parsed_loc.path.accession, seq_start, seq_stop)
=-=-=-=-=-=-=-=-=
I started to work on a way to assemble a sequence based on the
location information, but bailed and decided to handle only one
segment.
EUtils doesn't yet have throttling support - see Jason's comments
(and the EUtils distribution) about NCBI's use restrictions.
Andrew
More information about the BioPython
mailing list