[Bioperl-l] Re: Frameshifts in alignments ... ?
Hilmar Lapp
hlapp@gnf.org
Wed, 4 Sep 2002 09:45:40 -0700
Wouldn't we want the Transcript and Exon classes to utilize this too
for the cds() method, or is something in this mechanism that makes
it specific to alignments?
-hilmar
On Wednesday, September 4, 2002, at 08:00 AM, Aaron J Mackey wrote:
>
> Ewan wrote:
>
>> Remember that the "encoding" is as well as the bases, ie, one
>> effectively
>> has two "tracks", being
>>
>> CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
>> ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
>>
>>
>> I am happy to get into this. I would propose the following encodings:
>>
>> C - indicates is part of a codon (do we need C = 1st base of
>> codon, D = 2nd base of codon, E = 3rd base of codon?)
>
> I like the "CDE", myself; users can supply an encoded sequence with
> just
> C's though.
>
>> I - intron (also could go for I = phase 0 intron, J = phase 1
>> intron, K = phase 2 intron)
>
> Ditto.
>
>> F - frameshift indel (ie, sequencing error)
>
> Frameshifts come in a few varieties: "forward" (i.e. we skip an extra
> nucleotide, and translate the next full codon) or "backward" (i.e.
> we're
> missing a nucleotide, so we only use the adjacent two nucleotides
> for the
> codon). For example, using '\' and '/' for forward and backwards
> frameshifts (whitespace added for clarity):
>
> ATG GCTA GCT A-C GAC TAC
> CCC C\CC CCC C/C CCC CCC
>
> leads to a CDS of:
>
> ATG GTA GCT ANC GAC TAC
>
> So, in a sense, a backward frameshift is a bit like a gap, where a
> forward
> frameshift is a bit like a one-nucleotide intron. Regardless, I
> need two
> symbols, F and B maybe? Just nod if this works for you.
>
>> G - gap, implicitly protein coding
>>
>> H - gap, in "other" dna sequence
>
> These are protein indel parallels to F and B, right? i.e.:
>
> ATGGCT---GTATGT
> CCCCCCGGGCCCHHH
>
> would this then include the last codon TGT in the CDS? If so, how does
> HHH differ from CCC, in terms of influencing the encoded protein
> sequence?
> If it's only in terms of expressing the presence of a gap in another
> sequence, then I think that's better handled in that other sequence
> (together in a SimpleAlign or similar object). This is still a bit
> fuzzy
> for me ...
>
>> U - UTR (or U - 5'UTR, V - 3'UTR)
>
> Ditto as above for CDE and IJK; U and V will be used by us, but the
> user
> can just say U (and if we know strand, we can work out what's what).
>
> OK, then here is my proposed functionality; please read carefully
> as this
> is exactly what you'll get, unless someone raises a little noise soon:
>
>
> package Bio::EncodedSeq;
>
> use strict;
> use Bio::LocatableSeq;
>
> @ISA = qw(Bio::LocatableSeq);
>
> =head2 new
> Title : new
> Usage : $obj = Bio::EncodedSeq->new(-dnaseq => "AGTACGTGTCATG",
> -encoding => "CCCCCCFCCCCCC",
> -id => "myseq",
> -start => 1,
> -end => 13,
> -strand => 1
> );
> Function: creates a new Bio::EncodedSeq object from a supplied DNA
> sequence
> Returns : a new Bio::EncodedSeq object
> Args : dnaseq - primary nucleotide sequence used to encode the
> protein
> encoding - a string of characters (see Encoding Table)
> describing backwards frameshifts implied by the
> encoding but not present in the sequence will be
> added (as '-'s) to the sequence. If not
> supplied, it will be assumed that all positions
> are coding (C). Encoding may include either
> implicit phase encoding characters (i.e. "CCC")
> and/or explicit encoding characters (i.e. "CDE").
> Alternatively, encoding may be a hashref
> datastructure, with encoding characters as keys
> and Bio::LocationI objects (or arrayrefs of
> Bio::LocationI objects) as values, e.g.:
> { C => [ Bio::Location::Simple->new(1,9),
> Bio::Location::Simple->new(11,13) ],
> F => Bio::Location::Simple->new(10,10),
> } # same as "CCCCCCCCCFCCC"
> id, start, end, strand - as with Bio::LocatableSeq; note
> that the coordinates are relative to the
> encoding DNA sequence, not the implicit protein
> sequence.
> =cut
>
> =head2 encoding
> Title : encoding
> Usage : $obj->encoding("CCCCCC");
> $obj->encoding( -encoding => { I => $location } );
> $enc = $obj->encoding(-explicit => 1);
> $enc = $obj->encoding("CCCCCC", -explicit => 1);
> $enc = $obj->encoding(-location => $location,
> -explicit => 1 );
> Function: get/set the objects encoding, either globally or by
> location(s).
> Returns : the (possibly new) encoding string.
> Args : encoding - see the encoding argument to the new() function.
> explicit - whether or not to return explicit phase
> information in the coding (i.e. "CCC" becomes
> "CDE", "III" becomes "IJK", etc); defaults to 0.
> location - optional; location to get/set the encoding.
> Defaults to the entire sequence.
> =cut
>
> =head2 cds
> Title : cds
> Usage : $cds = $obj->cds();
> Function: obtain the "spliced" DNA sequence, by removing any
> nucleotides that participate in an UTR, forward frameshift
> or intron, and replacing any unknown nucleotide implied by
> a backward frameshift or gap with N's.
> Returns : a Bio::EncodedSeq object, with an encoding consisting only
> of "CCCC..".
> Args : none.
> =cut
>
> =head2 translate
> Title : translate
> Usage : $prot = $obj->translate(@args);
> Function: obtain the protein sequence encoded by the underlying DNA
> sequence; same as $obj->cds()->translate(@args).
> Returns : a Bio::PrimarySeq object.
> Args : same as the translate() function of Bio::PrimarySeqI
> =cut
>
> =head2 seq
> Title : seq
> Usage : $protseq = $obj->seq();
> Function: obtain the raw protein sequence encoded by the underlying
> DNA sequence; This is the same as calling
> $obj->translate()->seq();
> Returns : a string of single-letter amino acid codes
> Args : same as the seq() function of Bio::PrimarySeq; note that
> this
> function may not be used to set the protein sequence; see
> the dnaseq() function for that.
> =cut
>
> =head2 dnaseq
> Title : dnaseq
> Usage : $dnaseq = $obj->dnaseq();
> $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
> $obj->dnaseq(-dnaseq => "ATG",
> -encoding => "CCC",
> -location => $loc );
> Function: get/set the underlying DNA sequence; will overwrite any
> current DNA and/or encoding information present.
> Returns : a string of single-letter nucleotide codes, including any
> gaps implied by the encoding.
> Args : dnaseq - the DNA sequence to be used as a replacement
> encoding - the encoding of the DNA sequence (see the new()
> constructor); defaults to all 'C'.
> location - optional, the location of the DNA sequence to
> get/set; defaults to the entire sequence.
> =cut
>
> [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> methods; note that the coordinates of those methods will refer only to
> the underlying DNA sequence, not the implicit encoded protein sequence
> - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> system to include Bio::Coordinate::EncodedPair so that conversions can
> be made more easily ... any comments on that? ]
>
> thanks for reading,
>
> -Aaron
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey@virginia.edu
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------