[Bioperl-l] Re: Frameshifts in alignments ... ?

Wed, 4 Sep 2002 11:00:13 -0400 (EDT)

Ewan wrote:

> Remember that the "encoding" is as well as the bases, ie, one effectively
> has two "tracks", being
>
>   CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
>   ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
>
>
> I am happy to get into this. I would propose the following encodings:
>
>  C - indicates is part of a codon (do we need C = 1st base of codon, D = 2nd base of codon, E = 3rd base of codon?)

I like the "CDE", myself; users can supply an encoded sequence with just
C's though.

>  I - intron (also could go for I = phase 0 intron, J = phase 1 intron, K = phase 2 intron)

Ditto.

>  F - frameshift indel (ie, sequencing error)

Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
nucleotide, and translate the next full codon) or "backward" (i.e. we're
missing a nucleotide, so we only use the adjacent two nucleotides for the
codon).  For example, using '\' and '/' for forward and backwards
frameshifts (whitespace added for clarity):

ATG GCTA GCT A-C GAC TAC
CCC C\CC CCC C/C CCC CCC

leads to a CDS of:

ATG GTA  GCT ANC GAC TAC

So, in a sense, a backward frameshift is a bit like a gap, where a forward
frameshift is a bit like a one-nucleotide intron.  Regardless, I need two
symbols, F and B maybe?  Just nod if this works for you.

>  G - gap, implicitly protein coding
>
>  H - gap, in "other" dna sequence

These are protein indel parallels to F and B, right?  i.e.:

ATGGCT---GTATGT
CCCCCCGGGCCCHHH

would this then include the last codon TGT in the CDS?  If so, how does
HHH differ from CCC, in terms of influencing the encoded protein sequence?
If it's only in terms of expressing the presence of a gap in another
sequence, then I think that's better handled in that other sequence
(together in a SimpleAlign or similar object).  This is still a bit fuzzy
for me ...

>  U - UTR (or U - 5'UTR, V - 3'UTR)

Ditto as above for CDE and IJK; U and V will be used by us, but the user
can just say U (and if we know strand, we can work out what's what).

OK, then here is my proposed functionality; please read carefully as this
is exactly what you'll get, unless someone raises a little noise soon:

package Bio::EncodedSeq;

use strict;
use Bio::LocatableSeq;

@ISA = qw(Bio::LocatableSeq);

=head2 new
 Title   : new
 Usage   : $obj = Bio::EncodedSeq->new(-dnaseq   => "AGTACGTGTCATG",
                                       -encoding => "CCCCCCFCCCCCC",
                                       -id       => "myseq",
                                       -start    => 1,
                                       -end      => 13,
                                       -strand   => 1
                                      );
 Function: creates a new Bio::EncodedSeq object from a supplied DNA
           sequence
 Returns : a new Bio::EncodedSeq object
 Args    : dnaseq   - primary nucleotide sequence used to encode the
                      protein
           encoding - a string of characters (see Encoding Table)
                      describing backwards frameshifts implied by the
                      encoding but not present in the sequence will be
                      added (as '-'s) to the sequence.  If not
                      supplied, it will be assumed that all positions
                      are coding (C).  Encoding may include either
                      implicit phase encoding characters (i.e. "CCC")
                      and/or explicit encoding characters (i.e. "CDE").
                      Alternatively, encoding may be a hashref
                      datastructure, with encoding characters as keys
                      and Bio::LocationI objects (or arrayrefs of
                      Bio::LocationI objects) as values, e.g.:
                      { C => [ Bio::Location::Simple->new(1,9),
                               Bio::Location::Simple->new(11,13) ],
                        F => Bio::Location::Simple->new(10,10),
                      } # same as "CCCCCCCCCFCCC"
           id, start, end, strand - as with Bio::LocatableSeq; note
                      that the coordinates are relative to the
                      encoding DNA sequence, not the implicit protein
                      sequence.
=cut

=head2 encoding
 Title   : encoding
 Usage   : $obj->encoding("CCCCCC");
           $obj->encoding( -encoding => { I => $location } );
           $enc = $obj->encoding(-explicit => 1);
           $enc = $obj->encoding("CCCCCC", -explicit => 1);
           $enc = $obj->encoding(-location => $location,
                                 -explicit => 1 );
 Function: get/set the objects encoding, either globally or by location(s).
 Returns : the (possibly new) encoding string.
 Args    : encoding - see the encoding argument to the new() function.
           explicit - whether or not to return explicit phase
                      information in the coding (i.e. "CCC" becomes
                      "CDE", "III" becomes "IJK", etc); defaults to 0.
           location - optional; location to get/set the encoding.
                      Defaults to the entire sequence.
=cut

=head2 cds
 Title   : cds
 Usage   : $cds = $obj->cds();
 Function: obtain the "spliced" DNA sequence, by removing any
           nucleotides that participate in an UTR, forward frameshift
           or intron, and replacing any unknown nucleotide implied by
           a backward frameshift or gap with N's.
 Returns : a Bio::EncodedSeq object, with an encoding consisting only
           of "CCCC..".
 Args    : none.
=cut

=head2 translate
 Title   : translate
 Usage   : $prot = $obj->translate(@args);
 Function: obtain the protein sequence encoded by the underlying DNA
           sequence; same as $obj->cds()->translate(@args).
 Returns : a Bio::PrimarySeq object.
 Args    : same as the translate() function of Bio::PrimarySeqI
=cut

=head2 seq
 Title   : seq
 Usage   : $protseq = $obj->seq();
 Function: obtain the raw protein sequence encoded by the underlying
           DNA sequence; This is the same as calling
           $obj->translate()->seq();
 Returns : a string of single-letter amino acid codes
 Args :    same as the seq() function of Bio::PrimarySeq; note that this
           function may not be used to set the protein sequence; see
           the dnaseq() function for that.
=cut

=head2 dnaseq
 Title   : dnaseq
 Usage   : $dnaseq = $obj->dnaseq();
           $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
           $obj->dnaseq(-dnaseq => "ATG",
                        -encoding => "CCC",
                        -location => $loc );
 Function: get/set the underlying DNA sequence; will overwrite any
           current DNA and/or encoding information present.
 Returns : a string of single-letter nucleotide codes, including any
           gaps implied by the encoding.
 Args    : dnaseq   - the DNA sequence to be used as a replacement
           encoding - the encoding of the DNA sequence (see the new()
                      constructor); defaults to all 'C'.
           location - optional, the location of the DNA sequence to
                      get/set; defaults to the entire sequence.
=cut

[ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
methods; note that the coordinates of those methods will refer only to
the underlying DNA sequence, not the implicit encoded protein sequence
- my next task will be to extend Ewan and Heikki's Bio::Coordinate
system to include Bio::Coordinate::EncodedPair so that conversions can
be made more easily ... any comments on that? ]

thanks for reading,

-Aaron

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu