[Bioperl-l] Re: Frameshifts in alignments ... ?
Jason Stajich
jason@cgt.mc.duke.edu
Thu, 5 Sep 2002 03:05:19 -0400 (EDT)
On Thu, 5 Sep 2002, Ewan Birney wrote:
> On Wed, 4 Sep 2002, Aaron J Mackey wrote:
>
> >
> > package Bio::EncodedSeq;
>
> I think we should go for Bio::Seq::EncodedSeq
ditto here and ditto about being brave below!
>
>
> >
> > use strict;
> > use Bio::LocatableSeq;
> >
> > @ISA = qw(Bio::LocatableSeq);
> >
> > =head2 new
> > Title : new
> > Usage : $obj = Bio::EncodedSeq->new(-dnaseq => "AGTACGTGTCATG",
> > -encoding => "CCCCCCFCCCCCC",
> > -id => "myseq",
> > -start => 1,
> > -end => 13,
> > -strand => 1
> > );
> > Function: creates a new Bio::EncodedSeq object from a supplied DNA
> > sequence
> > Returns : a new Bio::EncodedSeq object
> > Args : dnaseq - primary nucleotide sequence used to encode the
> > protein
> > encoding - a string of characters (see Encoding Table)
> > describing backwards frameshifts implied by the
> > encoding but not present in the sequence will be
> > added (as '-'s) to the sequence. If not
> > supplied, it will be assumed that all positions
> > are coding (C). Encoding may include either
> > implicit phase encoding characters (i.e. "CCC")
> > and/or explicit encoding characters (i.e. "CDE").
> > Alternatively, encoding may be a hashref
> > datastructure, with encoding characters as keys
> > and Bio::LocationI objects (or arrayrefs of
> > Bio::LocationI objects) as values, e.g.:
> > { C => [ Bio::Location::Simple->new(1,9),
> > Bio::Location::Simple->new(11,13) ],
> > F => Bio::Location::Simple->new(10,10),
> > } # same as "CCCCCCCCCFCCC"
> > id, start, end, strand - as with Bio::LocatableSeq; note
> > that the coordinates are relative to the
> > encoding DNA sequence, not the implicit protein
> > sequence.
> > =cut
> >
> > =head2 encoding
> > Title : encoding
> > Usage : $obj->encoding("CCCCCC");
> > $obj->encoding( -encoding => { I => $location } );
> > $enc = $obj->encoding(-explicit => 1);
> > $enc = $obj->encoding("CCCCCC", -explicit => 1);
> > $enc = $obj->encoding(-location => $location,
> > -explicit => 1 );
> > Function: get/set the objects encoding, either globally or by location(s).
> > Returns : the (possibly new) encoding string.
> > Args : encoding - see the encoding argument to the new() function.
> > explicit - whether or not to return explicit phase
> > information in the coding (i.e. "CCC" becomes
> > "CDE", "III" becomes "IJK", etc); defaults to 0.
> > location - optional; location to get/set the encoding.
> > Defaults to the entire sequence.
> > =cut
> >
> > =head2 cds
> > Title : cds
> > Usage : $cds = $obj->cds();
> > Function: obtain the "spliced" DNA sequence, by removing any
> > nucleotides that participate in an UTR, forward frameshift
> > or intron, and replacing any unknown nucleotide implied by
> > a backward frameshift or gap with N's.
> > Returns : a Bio::EncodedSeq object, with an encoding consisting only
> > of "CCCC..".
> > Args : none.
> > =cut
> >
> > =head2 translate
> > Title : translate
> > Usage : $prot = $obj->translate(@args);
> > Function: obtain the protein sequence encoded by the underlying DNA
> > sequence; same as $obj->cds()->translate(@args).
> > Returns : a Bio::PrimarySeq object.
> > Args : same as the translate() function of Bio::PrimarySeqI
> > =cut
> >
> > =head2 seq
> > Title : seq
> > Usage : $protseq = $obj->seq();
> > Function: obtain the raw protein sequence encoded by the underlying
> > DNA sequence; This is the same as calling
> > $obj->translate()->seq();
> > Returns : a string of single-letter amino acid codes
> > Args : same as the seq() function of Bio::PrimarySeq; note that this
> > function may not be used to set the protein sequence; see
> > the dnaseq() function for that.
> > =cut
> >
> > =head2 dnaseq
> > Title : dnaseq
> > Usage : $dnaseq = $obj->dnaseq();
> > $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
> > $obj->dnaseq(-dnaseq => "ATG",
> > -encoding => "CCC",
> > -location => $loc );
> > Function: get/set the underlying DNA sequence; will overwrite any
> > current DNA and/or encoding information present.
> > Returns : a string of single-letter nucleotide codes, including any
> > gaps implied by the encoding.
> > Args : dnaseq - the DNA sequence to be used as a replacement
> > encoding - the encoding of the DNA sequence (see the new()
> > constructor); defaults to all 'C'.
> > location - optional, the location of the DNA sequence to
> > get/set; defaults to the entire sequence.
> > =cut
> >
> > [ and all the inherited Bio::LocatableSeq and Bio::PrimarySeqI
> > methods; note that the coordinates of those methods will refer only to
> > the underlying DNA sequence, not the implicit encoded protein sequence
> > - my next task will be to extend Ewan and Heikki's Bio::Coordinate
> > system to include Bio::Coordinate::EncodedPair so that conversions can
> > be made more easily ... any comments on that? ]
>
>
> You are a brave man. Look forward to seeing this in...
>
>
>
> >
> > thanks for reading,
> >
> > -Aaron
> >
> > --
> > Aaron J Mackey
> > Pearson Laboratory
> > University of Virginia
> > (434) 924-2821
> > amackey@virginia.edu
> >
> >
> >
> >
>
> -----------------------------------------------------------------
> Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> <birney@ebi.ac.uk>.
> -----------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu