Bioperl: relative-majority consensus, fast code sought

Georg Fuellen fuellen@alum.mit.edu
Mon, 1 Mar 1999 21:36:30 +0100 (MET)


Many thanks to Andrew Dalke for mailing me a suggestion to
accelerate the sort-- he has also pointed me to two typos 
which are now corrected in the example appended.
Any further hints would be greatly appreciated :)

Andrew Dalke wrote,
> Finally, why do the sorts (n*log(n)) operations when you really
> want a search (order(n))?
> 
> @chars = split(//, "AGCATCGATGATAGTGCTATGAATCGTATGATATGCCAA");
> $threshold = 0.15;
> $threshold *= ($#chars+1);
> 
> %temp = ();               # count the occurance of each character
> grep ++$temp{$_}, @chars;
> 
> $best_key = undef;        # flag to tell if something met the threshold
> $best_val = $threshold;   # known minimum value
> while ( ($k, $v) = each %temp) {   # linear search to find the max
>   if ($v>$best_val ||
>      ($v==$best_val && (! $best_key || $best_key lt $k))) {
>     $best_key = $k;
>     $best_val = $v;
>   }
> }
> if ($best_key) {
>   print "$best_key ($best_val)\n";
> } else {
>   print "!\n";
> }

Thanks-- most of the @char vectors are just 4-10 characters long,
but since the subroutine is called many many times, it's probably
a big improvement (it seems to be about 50% on my small test data set)

> 						Andrew
> 						dalke@bioreason.com
> 

Corrected code (hopefully;-):

>>>>>>>
Goal: To compute the consensus of the letters in @chars,
w/ the option to request the relative-majority winner,
and break ties according to alphabetical order.

$threshold = 0; # $threshold==0 (or 0.249) implies relative majority
                # $threshold==0.33 implies relative majority > one-third
                # $threshold==0.5 implies absolute majority,
                # $threshold==0.66 a two-thirds majority
$threshold *= ($#chars+1);  #eg if there are 50 chars, $threshold==0.5, 
                             #25 is the lower bound for absolute majority
%temp = ();
@list = sort { $temp{$b}<=>$temp{$a} } grep ++$temp{$_} > $threshold, @chars;
  #@list is ordered by number of occurances, only chars observed enough times
@list2 = sort {$a cmp $b} grep { $temp{$_} == $temp{$list[0]} } @list;
  #@list2 is ordered lexicographically, only chars observed most often
$consensus = (defined($list2[0]) ? $list2[0] : "!"); 
  #"!" -> no consensus
<<<<<<<

best wishes,
georg
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================