[Bioperl-l] Re: Frameshifts in alignments ... ?

Hilmar Lapp hlapp@gnf.org
Wed, 4 Sep 2002 09:45:40 -0700


Wouldn't we want the Transcript and Exon classes to utilize this too 
for the cds() method, or is something in this mechanism that makes 
it specific to alignments?

	-hilmar

On Wednesday, September 4, 2002, at 08:00  AM, Aaron J Mackey wrote:

>
> Ewan wrote:
>
>> Remember that the "encoding" is as well as the bases, ie, one 
>> effectively
>> has two "tracks", being
>>
>>   CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
>>   ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
>>
>>
>> I am happy to get into this. I would propose the following encodings:
>>
>>  C - indicates is part of a codon (do we need C = 1st base of 
>> codon, D = 2nd base of codon, E = 3rd base of codon?)
>
> I like the "CDE", myself; users can supply an encoded sequence with 
> just
> C's though.
>
>>  I - intron (also could go for I = phase 0 intron, J = phase 1 
>> intron, K = phase 2 intron)
>
> Ditto.
>
>>  F - frameshift indel (ie, sequencing error)
>
> Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
> nucleotide, and translate the next full codon) or "backward" (i.e. 
> we're
> missing a nucleotide, so we only use the adjacent two nucleotides 
> for the
> codon).  For example, using '\' and '/' for forward and backwards
> frameshifts (whitespace added for clarity):
>
> ATG GCTA GCT A-C GAC TAC
> CCC C\CC CCC C/C CCC CCC
>
> leads to a CDS of:
>
> ATG GTA  GCT ANC GAC TAC
>
> So, in a sense, a backward frameshift is a bit like a gap, where a 
> forward
> frameshift is a bit like a one-nucleotide intron.  Regardless, I 
> need two
> symbols, F and B maybe?  Just nod if this works for you.
>
>>  G - gap, implicitly protein coding
>>
>>  H - gap, in "other" dna sequence
>
> These are protein indel parallels to F and B, right?  i.e.:
>
> ATGGCT---GTATGT
> CCCCCCGGGCCCHHH
>
> would this then include the last codon TGT in the CDS?  If so, how does
> HHH differ from CCC, in terms of influencing the encoded protein 
> sequence?
> If it's only in terms of expressing the presence of a gap in another
> sequence, then I think that's better handled in that other sequence
> (together in a SimpleAlign or similar object).  This is still a bit 
> fuzzy
> for me ...
>
>>  U - UTR (or U - 5'UTR, V - 3'UTR)
>
> Ditto as above for CDE and IJK; U and V will be used by us, but the 
> user
> can just say U (and if we know strand, we can work out what's what).
>
> OK, then here is my proposed functionality; please read carefully 
> as this
> is exactly what you'll get, unless someone raises a little noise soon:
>
>
> package Bio::EncodedSeq;
>
> use strict;
> use Bio::LocatableSeq;
>
> @ISA = qw(Bio::LocatableSeq);
>
> =head2 new
>  Title   : new
>  Usage   : $obj = Bio::EncodedSeq->new(-dnaseq   => "AGTACGTGTCATG",
>                                        -encoding => "CCCCCCFCCCCCC",
>                                        -id       => "myseq",
>                                        -start    => 1,
>                                        -end      => 13,
>                                        -strand   => 1
>                                       );
>  Function: creates a new Bio::EncodedSeq object from a supplied DNA
>            sequence
>  Returns : a new Bio::EncodedSeq object
>  Args    : dnaseq   - primary nucleotide sequence used to encode the
>                       protein
>            encoding - a string of characters (see Encoding Table)
>                       describing backwards frameshifts implied by the
>                       encoding but not present in the sequence will be
>                       added (as '-'s) to the sequence.  If not
>                       supplied, it will be assumed that all positions
>                       are coding (C).  Encoding may include either
>                       implicit phase encoding characters (i.e. "CCC")
>                       and/or explicit encoding characters (i.e. "CDE").
>                       Alternatively, encoding may be a hashref
>                       datastructure, with encoding characters as keys
>                       and Bio::LocationI objects (or arrayrefs of
>                       Bio::LocationI objects) as values, e.g.:
>                       { C => [ Bio::Location::Simple->new(1,9),
>                                Bio::Location::Simple->new(11,13) ],
>                         F => Bio::Location::Simple->new(10,10),
>                       } # same as "CCCCCCCCCFCCC"
>            id, start, end, strand - as with Bio::LocatableSeq; note
>                       that the coordinates are relative to the
>                       encoding DNA sequence, not the implicit protein
>                       sequence.
> =cut
>
> =head2 encoding
>  Title   : encoding
>  Usage   : $obj->encoding("CCCCCC");
>            $obj->encoding( -encoding => { I => $location } );
>            $enc = $obj->encoding(-explicit => 1);
>            $enc = $obj->encoding("CCCCCC", -explicit => 1);
>            $enc = $obj->encoding(-location => $location,
>                                  -explicit => 1 );
>  Function: get/set the objects encoding, either globally or by 
> location(s).
>  Returns : the (possibly new) encoding string.
>  Args    : encoding - see the encoding argument to the new() function.
>            explicit - whether or not to return explicit phase
>                       information in the coding (i.e. "CCC" becomes
>                       "CDE", "III" becomes "IJK", etc); defaults to 0.
>            location - optional; location to get/set the encoding.
>                       Defaults to the entire sequence.
> =cut
>
> =head2 cds
>  Title   : cds
>  Usage   : $cds = $obj->cds();
>  Function: obtain the "spliced" DNA sequence, by removing any
>            nucleotides that participate in an UTR, forward frameshift
>            or intron, and replacing any unknown nucleotide implied by
>            a backward frameshift or gap with N's.
>  Returns : a Bio::EncodedSeq object, with an encoding consisting only
>            of "CCCC..".
>  Args    : none.
> =cut
>
> =head2 translate
>  Title   : translate
>  Usage   : $prot = $obj->translate(@args);
>  Function: obtain the protein sequence encoded by the underlying DNA
>            sequence; same as $obj->cds()->translate(@args).
>  Returns : a Bio::PrimarySeq object.
>  Args    : same as the translate() function of Bio::PrimarySeqI
> =cut
>
> =head2 seq
>  Title   : seq
>  Usage   : $protseq = $obj->seq();
>  Function: obtain the raw protein sequence encoded by the underlying
>            DNA sequence; This is the same as calling
>            $obj->translate()->seq();
>  Returns : a string of single-letter amino acid codes
>  Args :    same as the seq() function of Bio::PrimarySeq; note that 
> this
>            function may not be used to set the protein sequence; see
>            the dnaseq() function for that.
> =cut
>
> =head2 dnaseq
>  Title   : dnaseq
>  Usage   : $dnaseq = $obj->dnaseq();
>            $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
>            $obj->dnaseq(-dnaseq => "ATG",
>                         -encoding => "CCC",
>                         -location => $loc );
>  Function: get/set the underlying DNA sequence; will overwrite any
>            current DNA and/or encoding information present.
>  Returns : a string of single-letter nucleotide codes, including any
>            gaps implied by the encoding.
>  Args    : dnaseq   - the DNA sequence to be used as a replacement
>            encoding - the encoding of the DNA sequence (see the new()
>                       constructor); defaults to all 'C'.
>            location - optional, the location of the DNA sequence to
>                       get/set; defaults to the entire sequence.
> =cut
>
> [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> methods; note that the coordinates of those methods will refer only to
> the underlying DNA sequence, not the implicit encoded protein sequence
> - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> system to include Bio::Coordinate::EncodedPair so that conversions can
> be made more easily ... any comments on that? ]
>
> thanks for reading,
>
> -Aaron
>
> --
>  Aaron J Mackey
>  Pearson Laboratory
>  University of Virginia
>  (434) 924-2821
>  amackey@virginia.edu
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------