[Bioperl-l] Possible Repeat E-Mail

Thu Apr 17 01:09:34 UTC 2014

spliced_seq() with no DB handle then is all you need to do if the location
wholely contained with the contig at hand. Your earlier email made it sound
like the spanned contigs -- but I think basically you want the coded_by
feature for the protein in order to get the pieces you need -- BTW this is
basically a problem I use for one of my tutorials so I probably have an
example solution up somewhere, I think this is it:

http://jason.open-bio.org/Bioperl_Tutorials/Duke_2004/problem_sets/Project1/

Jason Stajich
jason at bioperl.org
http://bioperl.org/wiki/User:Jason
http://twitter.com/hyphaltip

On Wed, Apr 16, 2014 at 11:38 AM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Based on the docs:
>
> https://metacpan.org/pod/Bio::SeqFeatureI#spliced_seq
>
> spliced_seq() would only use the DB handle if you had a split location and
> if part of the location is remote (neither of which you have).  This is
> designed for features like those spanning different contigs in an assembly,
> where one portion of the sequence is on a ‘remote’ fragment, like
> 'join(AB000123:5567-5589,80..1144)’.  Note the first segment in that
> example is located on the sequence AB000123, so spliced_seq() would
> retrieve that sequence from the remote database handle, grab the fragment,
> and splice it to the local segment.
>
> The location of the feature you have is:
>
>      CDS             1..1158
>
> It’s a simple location (no split), so it simply returns what you already
> have.
>
> You could probably have a look in the guts of spliced_seq() to see what
> it’s doing; I think you could pull out something that works for you.
>
> chris
>
> On Apr 16, 2014, at 12:07 PM, Warren Gallin <wgallin at ualberta.ca> wrote:
>
> > OK, I misunderstood the purpose of passing the Bio::DB::GenBank handle
> to the spliced_sequence method - I thought that used the Accession Number
> in the CDS feature to go to the GENBANK nucleotide sequence file and grab
> the nucleic acid sequence.
> >
> >
> > So am I correct in thinking that the approach of passing a handle to
> GenBank to the spliced_sequence method only reaches through from one
> nucleotide sequence entry to another nucleotide sequence entry?  Or am I
> sill missing the point?
> >
> >
> > Warren
> >
> > On Apr 16, 2014, at 10:55 AM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
> >
> >> Warren,
> >>
> >> Simple explanation.  The sequence record you are using is a *protein*
> record; spliced_seq() works on the instance’s sequence (the protein
> sequence).
> >>
> >> You need the nucleotide sequence associated with this one, so you have
> to retrieve the nucleotide record associated with it, the one mentioned in
> the ‘CDS’ feature (the ‘coded_by’ tag) for the file you retrieved:
> >>
> >>    CDS             1..1158
> >>                    /gene="KCNH2"
> >>                    /coded_by="NM_001193658.1:14..3490"
> >>                    /db_xref="GeneID:100064000"
> >>
> >> and then grab the relevant sequence from that.
> >>
> >> chris
> >>
> >>
> >> On Apr 16, 2014, at 12:37 AM, Warren Gallin <wgallin at ualberta.ca>
> wrote:
> >>
> >>> Jason,
> >>>
> >>>     My previous message bounced, presumably because I included an
> attachment.
> >>>
> >>>     On the chance that it did not make it through, here is the
> relevant test case:
> >>>
> >>> A script called Test_Script.pl is as follows:
> >>>
> >>> ____________________________________________________________________
> >>>
> >>> #!/usr/bin/perl
> >>>
> >>> use strict;
> >>> use warnings;
> >>> use DBI;
> >>> use Bio::Seq;
> >>> use Bio::DB::EUtilities;
> >>> use Bio::SeqIO;
> >>> use Bio::Seq;
> >>> use Data::Printer;
> >>> use Bio::DB::GenBank;
> >>>
> >>> my $gi = 302393575;  #This gi number is for the protein record of a
> horse ion channel
> >>> my $spliced_cds;
> >>> my $na_seq;
> >>> my %na_vkcnt_id;
> >>>
> >>> #Create a database handle to GENBANK for retrieving coding sequences
> >>>
> >>> my $gb_db = Bio::DB::GenBank->new();
> >>>
> >>>
> >>> #create a structure for fetching protein records from GENBANK
> >>>
> >>> my $factory = Bio::DB::EUtilities->new(
> >>>  -eutil   => 'efetch',
> >>>  -db      => 'protein',
> >>>  -rettype => 'gb',
> >>>  -email   => 'wgallin at ualberta.ca',
> >>>  -id      => $gi
> >>> );
> >>>
> >>> my $holding_file = 'protein_records.gb';
> >>>
> >>> $factory->get_Response( -file => $holding_file );
> >>>
> >>> my $seqin = Bio::SeqIO->new(
> >>>  -file   => $holding_file,
> >>>  -format => 'genbank'
> >>> );
> >>>
> >>> while ( my $seq = $seqin->next_seq ) {
> >>>  my $na_acc_gennt;
> >>>  my $hit_gi = $seq->primary_id;
> >>>
> >>>  for my $feature_object ( $seq->get_SeqFeatures ) {
> >>>
> >>>      if ( $feature_object->primary_tag eq "CDS" ) {
> >>>          $spliced_cds = $feature_object->spliced_seq($gb_db);
> >>>          $na_seq      = $spliced_cds->seq;
> >>>
> >>>      print "UPDATE gennt SET cds = \"$na_seq\" ;\n";
> >>>             }
> >>>  }
> >>> }
> >>> exit;
> >>>
> >>> ________________________________________________________
> >>>
> >>> When I run it I get the following output:
> >>>
> >>>
> >>> _________________________________________________________
> >>>
> >>> warrenglinsmbp2:140414_Update_Flawed_gennt_Entries wgallin$ perl
> Test_Script.pl
> >>>
> >>> --------------------- WARNING ---------------------
> >>> MSG: API has changed; please use '-db' or '-nosort' for args. See POD
> for more details.
> >>> ---------------------------------------------------
> >>> UPDATE gennt SET cds =
> "MPVRRGHVAPQNTFLDTIIRKFEGQSRKFIIANARVENCAVIYCNDGFCELCGYSRAEVMQRPCTCDFLHGPRTQRRAAAQIAQALLGAEERKVEISFYRKDGSCFLCLVDVVPVKNEDGAVIMFILNFEVVMEKDMVGSPARDTNHRGPPTSWLATGRAKTFRLKLPALLALTARESTVRPGGAGSTGAPGAVVVDVDLTPAAPSSESLALDEVTAMDNHVAGLGPAEERRALVGPGSPPACAPIPHPSPRAHSLNPDASGSSCSLARTRSRESCASVRRASSADDIEAMRTGLPPPPRHASTGAMHPLRSGLLNSTSDSDLVRYRTISKIPQITLNFVDLKGDPFLASPTSDREIIAPKIKERTHNVTEKVTQVLSLGADVLPEYKLQAPRIHRWTILHYSPFKAVWDWLILLLVIYTAVFTPYSAAFLLKETEEGPPATDCGYACQPLAVVDLIVDIMFIVDILINFRTTYVNANEEVVSHPGRIAVHYFKGWFLIDMVAAIPFDLLIFGSGSEELIGLLKTARLLRLVRVARKLDRYSEYGAAVLFLLMCTFALIAHWLACIWYAIGNMEQPHMDSRIGWLHNLGDQIGKPYNSSGLGGPSIKDKYVTALYFTFSSLTSVGFGNVSPNTNSEKIFSICVMLIGSLMYASIFGNVSAIIQRLYSGTARYHTQMLRVREFIRFHQIPNPLRQRLEEYFQHAWSYTNGIDMNAVLKGFPECLQADICLHLNRSLLQHCKPFRGATKGCLRALAMKFKTTHAPPGDTLVHAGDLLTALYFISRGSIEILRGDVVVAILGKNDIFGEPLNLYARPGKSNGDVRALTYCDLHKIHRDDLLEVLDMYPEFSDHFWSSLEITFNLRDTNMIPGSPGSTELEGGFNRQRKRKLSFRRRTDKDPEQPGEVSALGPGRAGAGPSSRGRPGGPWGESPSSGPSSPESSEDEGPGRSSSPLRLVPFSSPRPPGE!
> >>>
> PPGGEPLIEDCEKSSDTCNPLSGAFSGVSNIFSFWGDSRGRQYQELPRCPAPAPSLLNIPLSSPGRRPRGDVESRLDALQRQLNRLETRLSADMATVLQLLQRQMTLVPPAYSAVTTPGPGPTSTSPLLPVSPIPTLTLDSLSQVSQFMACEELPPGAPELPQDGPTRRLSLPGQLGALTSQPLHRHGSDPGS"
> ;
> >>> warrenglinsmbp2:140414_Update_Flawed_gennt_Entries wgallin$
> >>>
> >>>
> >>> ___________________________________________________________
> >>>
> >>> The problem is that the sequence is being returned as translated CDS
> rather than the nucleotide sequence of the CDS.
> >>>
> >>> This happens with every gi number that I have tried.
> >>>
> >>> I fell like I am missing some subtle BioPerl of wisdom, but I can not
> figure out what that is.
> >>>
> >>> Thanks for looking at this.
> >>>
> >>> Warren Gallin
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >
>
>