[BioPython] NCBI: from protein to CDS
Iddo Friedberg
idoerg at burnham.org
Thu Apr 10 20:18:09 EDT 2003
The truth Jason: Jeff Chang deviously put you up to this to get me to
write the appropriate code for Biopython, right?
Thanks! I'll use it!
Best,
Iddo
Jason Stajich wrote:
> Sorry had to do it =) apologies for the perl code on a python list...
>
> Not totally trivial but it's doable in bioperl.
> <caveats>
> Note don't use this for lots of request - NCBI will shut you off. Better
> to download genbank local, yadda yadda if you are going to do multiple
> genomes. Follow NCBI's directions on limiting queries (we institute a
> 2sec delay on our queries).
> </caveats>
> (download/install bioperl-1.2.1 and IO::String)
> #!/usr/bin/perl -w
> use strict;
> use Bio::DB::GenBank;
> use Bio::DB::GenPept;
> use Bio::DB::FileCache;
> use Bio::Factory::FTLocationFactory;
> use Bio::SeqFeature::Generic;
>
> my $ntdb = new Bio::DB::GenBank;
> my $pepdb= new Bio::DB::GenPept;
>
> # do some caching in the event you're pulling up the same
> # chromosome and/or you are debugging
> my $cachent = new Bio::DB::FileCache(-kept => 1,
> -file => '/tmp/cache/nt.idx',
> -seqdb => $ntdb);
>
> my $cachepep = new Bio::DB::FileCache(-kept => 1,
> -file => '/tmp/cache/pep.idx',
> -seqdb => $pepdb);
>
> # obj to turn strings into Bio::Location object
> my $locfactory = new Bio::Factory::FTLocationFactory;
>
> # you might get these from a file (and they can be accessions too)
> my @protgis = (10956263);
>
> foreach my $gi ( @protgis ) {
> my $protseq = $cachepep->get_Seq_by_id($gi);
> if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
> next;
> }
> foreach my $cds ( grep { $_->primary_tag eq 'CDS' }
> $protseq->get_SeqFeatures() )
> {
> next unless( $cds->has_tag('coded_by') ); # skip CDSes with no coded_by
> my ($codedby) = $cds->each_tag_value('coded_by');
> my ($ntacc,$loc) = split(/\:/, $codedby);
> $ntacc =~ s/(\.\d+)//; # genbank wants an accession not a versioned one
> my $cdslocation = $locfactory->from_string($loc);
> my $cdsfeature = new Bio::SeqFeature::Generic(-location => $cdslocation);
> my $ntseq = $cachent->get_Seq_by_acc($ntacc);
> next unless $ntseq;
> $ntseq->add_SeqFeature($cdsfeature); # locate the feature on a seq
> my $cdsseq = $cdsfeature->spliced_seq();
> print "cds seq is ", $cdsseq->seq(), "\n";
> }
> }
>
> On Thu, 10 Apr 2003, Iddo Friedberg wrote:
>
>
>>Sorry, slight mistake in my question.
>>
>>The protein sequence is retrieved using the gi-10956263
>>
>>Then the CDS link hold the following:
>>
>>val=10956247 (which is the nucleotide GI)
>>itemID=98 (which tells us which bit of the nucleotide sequence actually
>>codes for this protein).
>>
>>Best,
>>
>>Iddo
>>
>>Iddo Friedberg wrote:
>>
>>>Hi,
>>>
>>>Can anyone suggest a painless way to retrieve a coding sequence using a
>>>protein gi number via NCBI? manually that would mean to:
>>>
>>>1) go to the protein page using the protein gi.
>>>2) Click on the CDS
>>>3) Get the DNA sequence coding to it.
>>>
>>>Here why this is a bummer:
>>>
>>>Using the gi-10956247 a protein record is retrieved from NCBI. The URL
>>>for the CDS looks like :
>>>
>>>"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
>>>
>>>
>>>the "val" field holds the same gi number, but the itemID field varies
>>>(that depends on the coding location in the full genome record, in this
>>>case). So I cannot use the protein gi to go and retrieve its coding
>>>sequence.
>>>
>>>I *can* write something to read the URL itemID field value, I just
>>>thought there might be a more elegant way. Maybe even using "legitimate"
>>>NCBI mechanisms...
>>>
>>>Thanks,
>>>
>>>Iddo
>>>
>>>
>>>
>>
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> BioPython mailing list - BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
>
>
--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://bioinformatics.ljcrf.edu/~iddo
More information about the BioPython
mailing list