[BioPython] NCBI: from protein to CDS

Thu Apr 10 23:16:40 EDT 2003

Sorry had to do it =)  apologies for the perl code on a python list...

Not totally trivial but it's doable in bioperl.
<caveats>
Note don't use this for lots of request - NCBI will shut you off.  Better
to download genbank local, yadda yadda if you are going to do multiple
genomes.  Follow NCBI's directions on limiting queries (we institute a
2sec delay on our queries).
</caveats>
(download/install bioperl-1.2.1 and IO::String)
#!/usr/bin/perl -w
use strict;
use Bio::DB::GenBank;
use Bio::DB::GenPept;
use Bio::DB::FileCache;
use Bio::Factory::FTLocationFactory;
use Bio::SeqFeature::Generic;

my $ntdb = new Bio::DB::GenBank;
my $pepdb= new Bio::DB::GenPept;

# do some caching in the event you're pulling up the same
# chromosome and/or you are debugging
my $cachent = new Bio::DB::FileCache(-kept => 1,
				     -file => '/tmp/cache/nt.idx',
				     -seqdb => $ntdb);

my $cachepep = new Bio::DB::FileCache(-kept => 1,
				      -file => '/tmp/cache/pep.idx',
				      -seqdb => $pepdb);

# obj to turn strings into Bio::Location object
my $locfactory = new Bio::Factory::FTLocationFactory;

# you might get these from a file (and they can be accessions too)
my @protgis = (10956263);

foreach my $gi ( @protgis ) {
  my $protseq = $cachepep->get_Seq_by_id($gi);
  if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
		     next;
	           }
  foreach my $cds (  grep { $_->primary_tag eq 'CDS' }
                          $protseq->get_SeqFeatures() )
  {
     next unless( $cds->has_tag('coded_by') ); # skip CDSes with no coded_by
     my ($codedby) = $cds->each_tag_value('coded_by');
     my ($ntacc,$loc) = split(/\:/, $codedby);
     $ntacc =~ s/(\.\d+)//; # genbank wants an accession not a versioned one
     my $cdslocation = $locfactory->from_string($loc);
     my $cdsfeature = new Bio::SeqFeature::Generic(-location => $cdslocation);
     my $ntseq = $cachent->get_Seq_by_acc($ntacc);
     next unless $ntseq;
     $ntseq->add_SeqFeature($cdsfeature); # locate the feature on a seq
     my $cdsseq = $cdsfeature->spliced_seq();
     print "cds seq is ", $cdsseq->seq(), "\n";
 }
}

On Thu, 10 Apr 2003, Iddo Friedberg wrote:

> Sorry, slight mistake in my question.
>
> The protein sequence is retrieved using the gi-10956263
>
> Then the CDS link hold the following:
>
> val=10956247  (which is the nucleotide GI)
> itemID=98 (which tells us which bit of the nucleotide sequence actually
> codes for this protein).
>
> Best,
>
> Iddo
>
> Iddo Friedberg wrote:
> > Hi,
> >
> > Can anyone suggest a painless way to retrieve a coding sequence using a
> > protein gi number via NCBI? manually that would mean to:
> >
> > 1) go to the protein page using the protein gi.
> > 2) Click on the CDS
> > 3) Get the DNA sequence coding to it.
> >
> > Here why this is a bummer:
> >
> > Using the gi-10956247 a protein record is retrieved from NCBI. The URL
> > for the CDS looks like :
> >
> > "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
> >
> >
> > the "val" field holds the same gi number, but the itemID field varies
> > (that depends on the coding location in the full genome record, in this
> > case). So I cannot use the protein gi to go and retrieve its coding
> > sequence.
> >
> > I *can* write something to read the URL itemID field value, I just
> > thought there might be a more elegant way. Maybe even using "legitimate"
> > NCBI mechanisms...
> >
> > Thanks,
> >
> > Iddo
> >
> >
> >
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu