[BioPython] NCBI: from protein to CDS
Jason Stajich
jason at cgt.mc.duke.edu
Thu Apr 10 23:16:40 EDT 2003
Sorry had to do it =) apologies for the perl code on a python list...
Not totally trivial but it's doable in bioperl.
<caveats>
Note don't use this for lots of request - NCBI will shut you off. Better
to download genbank local, yadda yadda if you are going to do multiple
genomes. Follow NCBI's directions on limiting queries (we institute a
2sec delay on our queries).
</caveats>
(download/install bioperl-1.2.1 and IO::String)
#!/usr/bin/perl -w
use strict;
use Bio::DB::GenBank;
use Bio::DB::GenPept;
use Bio::DB::FileCache;
use Bio::Factory::FTLocationFactory;
use Bio::SeqFeature::Generic;
my $ntdb = new Bio::DB::GenBank;
my $pepdb= new Bio::DB::GenPept;
# do some caching in the event you're pulling up the same
# chromosome and/or you are debugging
my $cachent = new Bio::DB::FileCache(-kept => 1,
-file => '/tmp/cache/nt.idx',
-seqdb => $ntdb);
my $cachepep = new Bio::DB::FileCache(-kept => 1,
-file => '/tmp/cache/pep.idx',
-seqdb => $pepdb);
# obj to turn strings into Bio::Location object
my $locfactory = new Bio::Factory::FTLocationFactory;
# you might get these from a file (and they can be accessions too)
my @protgis = (10956263);
foreach my $gi ( @protgis ) {
my $protseq = $cachepep->get_Seq_by_id($gi);
if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
next;
}
foreach my $cds ( grep { $_->primary_tag eq 'CDS' }
$protseq->get_SeqFeatures() )
{
next unless( $cds->has_tag('coded_by') ); # skip CDSes with no coded_by
my ($codedby) = $cds->each_tag_value('coded_by');
my ($ntacc,$loc) = split(/\:/, $codedby);
$ntacc =~ s/(\.\d+)//; # genbank wants an accession not a versioned one
my $cdslocation = $locfactory->from_string($loc);
my $cdsfeature = new Bio::SeqFeature::Generic(-location => $cdslocation);
my $ntseq = $cachent->get_Seq_by_acc($ntacc);
next unless $ntseq;
$ntseq->add_SeqFeature($cdsfeature); # locate the feature on a seq
my $cdsseq = $cdsfeature->spliced_seq();
print "cds seq is ", $cdsseq->seq(), "\n";
}
}
On Thu, 10 Apr 2003, Iddo Friedberg wrote:
> Sorry, slight mistake in my question.
>
> The protein sequence is retrieved using the gi-10956263
>
> Then the CDS link hold the following:
>
> val=10956247 (which is the nucleotide GI)
> itemID=98 (which tells us which bit of the nucleotide sequence actually
> codes for this protein).
>
> Best,
>
> Iddo
>
> Iddo Friedberg wrote:
> > Hi,
> >
> > Can anyone suggest a painless way to retrieve a coding sequence using a
> > protein gi number via NCBI? manually that would mean to:
> >
> > 1) go to the protein page using the protein gi.
> > 2) Click on the CDS
> > 3) Get the DNA sequence coding to it.
> >
> > Here why this is a bummer:
> >
> > Using the gi-10956247 a protein record is retrieved from NCBI. The URL
> > for the CDS looks like :
> >
> > "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
> >
> >
> > the "val" field holds the same gi number, but the itemID field varies
> > (that depends on the coding location in the full genome record, in this
> > case). So I cannot use the protein gi to go and retrieve its coding
> > sequence.
> >
> > I *can* write something to read the URL itemID field value, I just
> > thought there might be a more elegant way. Maybe even using "legitimate"
> > NCBI mechanisms...
> >
> > Thanks,
> >
> > Iddo
> >
> >
> >
>
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the BioPython
mailing list