[BioPython] NCBI: from protein to CDS

Fri Apr 11 11:07:06 EDT 2003

Wow! This is really cool... just what the doctor ordered.

I must confess I completely forgot about the existence of EUtils. I'm 
pretty much taking my baby steps in getting stuff off the Web in 
general, and NCBI in particular. The NCBI eutil documentation is a bit 
lacking, so this bit of code is actually a very good learning-by-example 
Biopython+EUtils combo.

Thanks!

Iddo

Andrew Dalke wrote:
> Iddo:
> 
>>> Can anyone suggest a painless way to retrieve a coding sequence using 
>>> a protein gi number via NCBI? manually that would mean to:
>>> 1) go to the protein page using the protein gi.
>>> 2) Click on the CDS
>>> 3) Get the DNA sequence coding to it.
> 
> 
> Hmm..  Here's one way using my EUtils package, available from
>   http://www.dalkescientific.com/EUtils/
> 
>  =-=-=-=-=-=-=-=-=
> 
> from EUtils import DBIds, DBIdsClient
> from Bio import GenBank
> from Bio.GenBank import LocationParser
> 
> # Get a handle to the protein record
> h = DBIdsClient.from_dbids(DBIds(db = "protein", ids = ["10956263"]))
> 
> # Fetch returns a file handle in GenPept format
> infile = h.efetch(retmode = "text", rettype = "gp")
> 
> # Parse the first record
> rec = GenBank.Iterator(infile, GenBank.FeatureParser()).next()
> 
> # Get the CDS feature
> for feature in rec.features:
>     if feature.type == "CDS":
>         break
> else:
>     raise "CDS not found"
> 
> # Get the location qualifier (note -- need to be fixed since there
> # may be multiple qualifiers with the same name)
> loc = feature.qualifiers["coded_by"][0]
> 
> # This returns a string like
> #     NC_001496.1:22188..23897
> 
> # little helper function to look up a sequence given the
> # sequence accession, start, and stop location.  Locations
> # are given in NCBI coordinates, not Python ones.
> def lookup(name, seq_start, seq_stop):
>     h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name]))
>     return h.efetch(retmode = "text", rettype = "fasta",
>                     seq_start = seq_start, seq_stop = seq_stop).read()
> 
> # Parse the location string
> parsed_loc = LocationParser.parse(LocationParser.scan(loc))
> 
> # Double-check that I know how to handle it.  Could be
> # extended to handle more cases.
> assert isinstance(parsed_loc, LocationParser.AbsoluteLocation)
> assert isinstance(parsed_loc.local_location, LocationParser.Range)
> 
> seq_start = parsed_loc.local_location.low
> seq_stop = parsed_loc.local_location.high
> assert isinstance(seq_start, LocationParser.Integer)
> assert isinstance(seq_stop, LocationParser.Integer)
> seq_start = seq_start.val
> seq_stop = seq_stop.val
> 
> # Get the actual sequence.  In this case, in FASTA format.
> print lookup(parsed_loc.path.accession, seq_start, seq_stop)
> 
>  =-=-=-=-=-=-=-=-=
> 
> I started to work on a way to assemble a sequence based on the
> location information, but bailed and decided to handle only one
> segment.
> 
> EUtils doesn't yet have throttling support - see Jason's comments
> (and the EUtils distribution) about NCBI's use restrictions.
> 
>                     Andrew
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://bioinformatics.ljcrf.edu/~iddo