[Bioperl-l] Getting coding sequence starting with a protein record

Warren Gallin wgallin at ualberta.ca
Tue Apr 15 21:23:50 UTC 2014


Jason,

	Works almost perfectly, except I am getting back the protein sequence rather than the underlying nucleotide sequence.

	My specific code fragment is:



	my $gb_db = Bio::DB::GenBank->new();
 
	<...Bunch of code that retrieves a protein GenBank formatted file and walks through the features until...>

        my $feature = $feature_object->primary_tag;

        if ( $feature ne "CDS" ) { next; }
        else {
        	$spliced_cds = $feature_object->spliced_seq($gb_db);
        	$na_seq      = $spliced_cds->seq;

        }

	< More code, that leads to printing the value for $na_seq …>

	So somehow the nucleotide sequence is being translated into protein sequence - is there some option that needs setting to prevent translation?

Warren


On Apr 15, 2014, at 1:11 PM, Jason Stajich <jason at bioperl.org> wrote:

> This is supported in bioperl with the feature objects and the Bio::SeqFeatureI method spliced_seq - 
> You would just have  Bio::DB::GenBank object which you provide to the function;
> 
> my $db = Bio::DB::Genbank->new();
> my $spliced_cds = $feature_with_remote_locations->spliced_seq($db);
> 
> 
> 
> 
> Jason Stajich
> jason at bioperl.org
> http://bioperl.org/wiki/User:Jason
> http://twitter.com/hyphaltip
> 
> 
> On Tue, Apr 15, 2014 at 11:39 AM, Warren Gallin <wgallin at ualberta.ca> wrote:
> I am having a problem finding a general method of recovering the nucleotide coding sequence for a protein sequence record.
> 
> Generally tracking the CDS annotation back to the nucleotide sequence record using the accession number of the nucleotide sequence is working.
> 
> One problem arises when the underlying coding sequence is spliced from multiple nucleotide records.  Is there a general approach to automatically track down and joint the different sequence fragments from different sequence entries?  An example of the problem can be seen if you start from the protein record with GI number 7715882.  It is annotated as coming from three different nucleotide records.  Is there an approach in Bioperl that will detect and download these three records and splice together the appropriate parts to get the coding sequence?
> 
> The other problem that I am having is the ongoing issue of protein records annotated as highly redundant sequences , with WP-XXXXXX accession numbers.  Has anyone found a way to retrieve the set of different nucleotide sequences that all encode a single AP-annotated protein sequence?
> 
> Any help would be appreciated,
> 
> Warren Gallin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 





More information about the Bioperl-l mailing list