[Bioperl-l] Trans-Splicing Coding Sequences

Sun Aug 22 03:46:53 EDT 2004

The problem is that we sort the order of the locations in the calling of
spliced_seq at around line 697 in Bio::SeqFeatureI so all you need to do
remove the sorting - if you want to add an optional 3rd argument to this
method which when true does not try and sort the sub locations.  Make
sense?  If it works, post a patch and we'll incorporate it.

-jason

 On Fri, 20 Aug 2004, James Thompson wrote:

> Dear Bioperlers,
>
> I'm currently working on a project involving some trans-spliced coding sequences
> from Arabidopsis, and I was wondering if BioPerl provided an easy way of taking a
> trans-spliced CDS feature and correctly splicing it into a Bio::Seq object.
>
> Here's my naive stab at doing this using the Bioperl methods that I know:
>
> use strict;
> use Bio::SeqIO;
>
> my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format => 'genbank' );
> my $genome = $seqio->next_seq;
>
> foreach my $cds (grep { $_->primary_tag =~ /CDS/i } $genome->get_SeqFeatures) {
>    print $cds->start, " -> ", $cds->end, "\n";
>    print $cds->spliced_seq->translate->seq, "\n";
>
>    last;
> }
>
> This just tries to use the spliced_seq method to splice the CDS into a sequence,
> and here's the output:
>
> 79740 -> 333105
> LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGYSIPFYCA*NLYFC
> *YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSITTNWYLYLCINDD
> GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSSPSGSSD*RYRSFL
> LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSDPPTHVQSFFSGFP
> RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EGQFYIFLPNLSIIK
> YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IFHGSRLEIFDLRCI
> FLWNIIVW
>
> The translated sequence for this coding sequence should look like this:
>
> MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
> YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
> STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
> VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
> TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
> ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
> SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
> ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
> TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
>
> I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't correctly
> recognizing the Location definition for this coding sequence, which looks like this:
> CDS             complement(join(327890..328078,329735..330306,
>                 332945..333105,79740..80132,81113..81297))
>
> Could anyone help me shed any light on this? Ideally I'd like to translate all
> of these CDS features into individual Bio::Seq objects for further analysis,
> and I thought I'd ask for a bit of help before I wrote my own parser. Should I
> try sub-classing Bio::SeqFeature to overwrite the spliced_seq method, or has
> someone else already figured out this problem? Any suggestions would be very
> helpful.
>
> Thanks,
>
> James Thompson
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu