[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq

Chris Fields cjfields at uiuc.edu
Fri Dec 8 20:03:16 UTC 2006


> Yes, I think. Scott Cain pointed out that GFF column 8 is the 
> "phase", which I had never heard of before. My current, very 
> limited, understanding is that sometimes you'll have an exon 
> with, say, 31 bp, followed by an exon with 29 bp. When the 
> intron gets spliced out, you eventually get an mRNA of 60 bp, 
> which translates to a protein of 20 aa.
> But the second exon has a phase of 1, not 0, because you 
> can't just start translating at the first bp of the second 
> exon and expect to get nice amino acids.

I think the use of 'frame' here is meant relative to the DNA sequence (i.e.
ORF searching, 6 frames) and the 'phase' is relative to the mRNA (i.e.
translation, three frames).  At least I think that's what is meant!

> By the way, whether or not phase is the same thing as frame, 
> when I call the frame() method on the features created by 
> Bio::Tools::GFF, I get the phase info. I assume that's a 
> feature (no pun intended), not a bug?
> 
> I'm still confused as to why you would have a phase in the 
> first exon, though. Why not just say the CDS starts 1 or 2 bp 
> later? (This is probably a bio question, not a bioperl 
> question, but a quick Google didn't get me an answer. "Phase" 
> isn't a very good search term.)

It could be b/c the location coordinates delineate the exon coding boundary.
It's conceivable the first exon in a sequence record is not the first exon
of the mRNA (i.e. there may be one or more exons prior to or past the exon
of interest that are in 'remote' sequence records).  Like this admittedly
extreme example (GB acc AF130134):

join(AF130124.1:2563..2964,AF130125.1:21..157,AF130126.1:12..174,
AF130127.1:21..112,AF130128.1:21..162,AF130128.1:281..595,
AF130128.1:661..842,AF130128.1:916..1030,AF130129.1:21..115,
AF130130.1:21..165,AF130131.1:21..125,AF130132.1:21..428,
AF130132.1:492..746,AF130133.1:21..168,AF130133.1:232..401,
AF130133.1:475..906,AF130133.1:970..1107,AF130133.1:1176..1367,21..128)

Also, the ends of the lcoation may be uncertain ('fuzzy'):

join(complement(1009..>1260),complement(AF081827.1:<1..177))

> I guess the real question here, which Jason alludes to, is whether
> SeqFeature->spliced_seq ought to take into account the phase 
> information
> of the first exon. Right now, it doesn't, so when you call
> SeqFeature->spliced_seq->translate, you get gibberish. Are there cases
> where you would want spliced_seq to include the first bp or 
> two? Should there be an option to spliced_seq for whether you 
> want to take phase information into account?
> 
> I can't submit a bug report until we confirm it's a bug.
> 
> Thanks,
> -Amir Karger

You can already pass the frame or an offset to PrimarySeqI::translate().
Here are the args:

 Args    : -terminator    - character for terminator        default is *
           -unknown       - character for unknown           default is X
           -frame         - frame                           default is 0
           -codontable_id - codon table id                  default is 1
           -complete      - complete CDS expected           default is 0
           -throw         - throw exception if not complete default is 0
           -orf           - find 1st ORF                    default is 0
           -start         - alternative initiation codon
           -codontable    - Bio::Tools::CodonTable object
           -offset        - offset for fuzzy locations      default is 0

The offset comes from some GenBank seqfeatures which have an '\codon_start'
tag indicating which nucleotide to start translation from (1,2,3).  This is
essentially just the phase+1.  We could add a '-phase' argument for
convenience which accepts 0,1,2.

chris




More information about the Bioperl-l mailing list