Bioperl: A greedy consensus builder

Ian Holmes ihh@lanl.gov
Wed, 18 Aug 1999 07:38:58 -0600 (MDT)


On Wed, 4 Aug 1999, Mike Cariaso wrote:

> To scratch an itch I've built a small object which ISA Bio::UnivAln.
> It has a single method 'greedy' which constucts a consensus sequence using a
> somewhat different method. The approach does ignores alignment gaps that are
> before (or after) the first (last) non-gap of a row. An example may explain it
> more clearly.
> 
> 
> Sample alignment:
> 
> row1  :ATCTTCGCTCGCTCGCTTATATA-ATAAGATAAGATATCGCTCCGCTCGCCTCGCTCCTCAAAGCTCGCTC
> row2  :--CTTCGCTCGCTCGCTTATATA-ATAAGATAAGATATCGCTCCGCTCGCCTCGCTCCTC---CCTCGCTC
> row3  :--------------------ATAAATAAGATAAGATATCGCTCTGCTCGCCTCGCTCCTC---CCTCGCTC
> row4  :---------------------------AGATAAGATATCGCTCCGCTCGCCTCGCTCCTC---CCTCGCTC
> row5  :---------------------------AGATAAGATATCGCTCCGCTCGCCTCGCTCCTC---CCTCGCTC
> row6  :-------------------------------------------CGCTCGCCTCGCTCCTC---CCTCGCTC
> 
> cons  :---------------------------AGATAAGATATCGCTCCGCTCGCCTCGCTCCTC---CCTCGCTC
> greedy:ATCTTCGCTCGCTCGCTTATATA-ATAAGATAAGATATCGCTCCGCTCGCCTCGCTCCTC---CCTCGCTC
> 
> 
> So for the above alignment the current technique will call the first 20 or so
> bases as gaps since that is the most common char. The greedy approach assumes
> that this area is outside the known region of those rows, and ignores the gaps
> there. This seems useful when working with small partial fragments.
> 
> 
> 
> If there is interest I'll be happy (honored, actually) to contribute it to
> bio.perl. 
> The interface has the threshold param as well as another optional one to specify
> the minimum number of rows necessary to do a base call. And if you noticed any
> errors in the alignment, its totally bogus data, so the mistake is mine.

This is not really a comment on your algorithm since I'm sure there are
many heuristics out there appropriate for different situations - which is
why I'd like to propose a move towards a common format.

I'm afraid I don't really know how (if at all) consensus sequence is
handled in bioperl at the minute, but from a theorists viewpoint I'd argue
for HMMER-style HMMs as the ideal representation for this, the main reason
being that HMMs handle gaps more consistently than treating them as a 21st
amino acid.

Ungapped profiles can be mapped onto HMMER profiles with nearly no effort
but there is an argument for having a separate data structure for these,
since they can be fed directly into many sequence analysis algorithms
without change, and are convenient for representing e.g. ambiguous
residues.

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================