[Bioperl-l] Re: Frameshifts in alignments ... ?

Ewan Birney birney@ebi.ac.uk
Tue, 3 Sep 2002 15:50:16 +0100 (BST)


On Tue, 3 Sep 2002, Aaron J Mackey wrote:

> 
> On Tue, 3 Sep 2002, Ewan Birney wrote:
> 
> > BTW - we should call them  Bio::Seq::EncodedSequence
> 
> Great ... (just out of curiousity, why not Bio::Seq::EncodedSeq ... or
> just Bio::Seq::Encoded; is there a reason for the redundancy?)
> 
> > Remember that the "encoding" is as well as the bases, ie, one effectively
> > has two "tracks", being
> >
> >    CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC
> >    ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGT---GTET
> 
> > I am happy to get into this. I would propose the following encodings:
> 
> > I could adapt genewise to directly output this stuff.
> 
> I guess I'm not sure why we need an *internal* encoding like this; I would
> argue that the various methods I proposed would be easier via the
> SeqFeature annotation representation (since relative to the length of the
> sequence, the number of gap/intron/frameshift locations should be small).
> Or do you just mean that this encoding should be available for dumping via
> $obj->encoding() (and perhaps acceptable to a new() constructor)?

I think the internals are (rightly) up to the implementor, and I was more
thinking about the interface being things like:

  $seq->encoding_string()

or something.


> 
> $obj = new Bio::Seq::EncodedSequence (-encoding => "CCCCCCCCCCCIIIIIIIIIIIIIIIIIIIIIIICCCCCGGGCCCC",
>                                       -sequence => "ATGGGTGTATGTATTGTGTAAAAAGAATGTTAAGGTTGTGTET",
>                                       -start => 100, -end => 128, -strand => 1
>                                      );
> 
> There was also my "embedded" encoding (which is what we tend to see in
> alignment outputs), with frameshift (/, \), intron boundaries ([...]) and
> gap characters, that I proposed could be obtained via as_string():
> 
> ATGGGT/GTATG[TATTGTGTAAAAAG]AATGT\TAAGGTTGT---GTET
> 

I think this is slightly insane (myself) as your coordinate system now has
to keep track of lots of hting - of course, it has to keep track of gaps
anyway. HMMM,.,,

> I guess now I'm inching towards an Bio::SeqIO::encoded::wise,
> Bio::SeqIO::encoded::tfastx, ... ?
> 

Nah. Lets stick to one implementation at the moment, but with a 

  Bio::Seq::EncodedSeqI 

we can slot in novel implementations if we like.


I would claim EncodedSeqI should have 


  $seq->encoding_string();

and

  $encoded_lable = $seq->encoding_label($position);


methods on it. The constructor should have a well documented way to
intiated the encoding, with could be either the string, or a set of
features or both (your choice)



> > Are you keen to code this up Aaron... or hoping I would ?
> 
> I'm good to go, given that I understand the desired direction ... and I
> do agree TIMTOWTDI and all.
> 

Let's do it one way first the end we can do it multiple ways later ;)




> -Aaron
> 
> -- 
>  Aaron J Mackey
>  Pearson Laboratory
>  University of Virginia
>  (434) 924-2821
>  amackey@virginia.edu
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------