[BioPython] NCBI: from protein to CDS

Thu Apr 10 20:18:09 EDT 2003

The truth Jason: Jeff Chang deviously put you up to this to get me to 
write the appropriate code for Biopython, right?

Thanks! I'll use it!

Best,

Iddo

Jason Stajich wrote:
> Sorry had to do it =)  apologies for the perl code on a python list...
> 
> Not totally trivial but it's doable in bioperl.
> <caveats>
> Note don't use this for lots of request - NCBI will shut you off.  Better
> to download genbank local, yadda yadda if you are going to do multiple
> genomes.  Follow NCBI's directions on limiting queries (we institute a
> 2sec delay on our queries).
> </caveats>
> (download/install bioperl-1.2.1 and IO::String)
> #!/usr/bin/perl -w
> use strict;
> use Bio::DB::GenBank;
> use Bio::DB::GenPept;
> use Bio::DB::FileCache;
> use Bio::Factory::FTLocationFactory;
> use Bio::SeqFeature::Generic;
> 
> my $ntdb = new Bio::DB::GenBank;
> my $pepdb= new Bio::DB::GenPept;
> 
> # do some caching in the event you're pulling up the same
> # chromosome and/or you are debugging
> my $cachent = new Bio::DB::FileCache(-kept => 1,
> 				     -file => '/tmp/cache/nt.idx',
> 				     -seqdb => $ntdb);
> 
> my $cachepep = new Bio::DB::FileCache(-kept => 1,
> 				      -file => '/tmp/cache/pep.idx',
> 				      -seqdb => $pepdb);
> 
> # obj to turn strings into Bio::Location object
> my $locfactory = new Bio::Factory::FTLocationFactory;
> 
> # you might get these from a file (and they can be accessions too)
> my @protgis = (10956263);
> 
> foreach my $gi ( @protgis ) {
>   my $protseq = $cachepep->get_Seq_by_id($gi);
>   if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
> 		     next;
> 	           }
>   foreach my $cds (  grep { $_->primary_tag eq 'CDS' }
>                           $protseq->get_SeqFeatures() )
>   {
>      next unless( $cds->has_tag('coded_by') ); # skip CDSes with no coded_by
>      my ($codedby) = $cds->each_tag_value('coded_by');
>      my ($ntacc,$loc) = split(/\:/, $codedby);
>      $ntacc =~ s/(\.\d+)//; # genbank wants an accession not a versioned one
>      my $cdslocation = $locfactory->from_string($loc);
>      my $cdsfeature = new Bio::SeqFeature::Generic(-location => $cdslocation);
>      my $ntseq = $cachent->get_Seq_by_acc($ntacc);
>      next unless $ntseq;
>      $ntseq->add_SeqFeature($cdsfeature); # locate the feature on a seq
>      my $cdsseq = $cdsfeature->spliced_seq();
>      print "cds seq is ", $cdsseq->seq(), "\n";
>  }
> }
> 
> On Thu, 10 Apr 2003, Iddo Friedberg wrote:
> 
> 
>>Sorry, slight mistake in my question.
>>
>>The protein sequence is retrieved using the gi-10956263
>>
>>Then the CDS link hold the following:
>>
>>val=10956247  (which is the nucleotide GI)
>>itemID=98 (which tells us which bit of the nucleotide sequence actually
>>codes for this protein).
>>
>>Best,
>>
>>Iddo
>>
>>Iddo Friedberg wrote:
>>
>>>Hi,
>>>
>>>Can anyone suggest a painless way to retrieve a coding sequence using a
>>>protein gi number via NCBI? manually that would mean to:
>>>
>>>1) go to the protein page using the protein gi.
>>>2) Click on the CDS
>>>3) Get the DNA sequence coding to it.
>>>
>>>Here why this is a bummer:
>>>
>>>Using the gi-10956247 a protein record is retrieved from NCBI. The URL
>>>for the CDS looks like :
>>>
>>>"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
>>>
>>>
>>>the "val" field holds the same gi number, but the itemID field varies
>>>(that depends on the coding location in the full genome record, in this
>>>case). So I cannot use the protein gi to go and retrieve its coding
>>>sequence.
>>>
>>>I *can* write something to read the URL itemID field value, I just
>>>thought there might be a more elegant way. Maybe even using "legitimate"
>>>NCBI mechanisms...
>>>
>>>Thanks,
>>>
>>>Iddo
>>>
>>>
>>>
>>
>>
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://bioinformatics.ljcrf.edu/~iddo