[Bioperl-l] how-to-remove-redundant-lines

Wed Jun 29 07:16:28 EDT 2005

>>>>> "vijayaraj" == vijayaraj nagarajan <bioinfovijayaraj at yahoo.com> writes:
vijayaraj> i have a cluster file with contents like this:

vijayaraj> 1 2 5 7 8 11
vijayaraj> 2 5 7 8 11 
vijayaraj> 3 13 17 19
vijayaraj> 4 21 45 67
vijayaraj> 5 7 8 11

vijayaraj> Now the 1,2 and 5th lines are redundant. i need to
vijayaraj> remove the 2nd and 5th line from the file, while
vijayaraj> retaining only the first line, since the first line
vijayaraj> contains all the members present in 2 and 5th line...

Here's something much better. It tries to be somewhat efficient.

Terry

#!/usr/bin/perl -w

use strict;
my @lines;

while (<>){
	my @nums = split;
	my $nums = {};
	map { $nums->{$_} = undef } @nums;
	push @lines, [ $nums, scalar(@nums) ];
}

my @sorted = sort { $lines[$b]->[1] <=> $lines[$a]->[1] } 0 .. $#lines;

for (my $i = 0; $i < @lines; $i++){
	print join(' ', sort { $a <=> $b } keys %{$lines[$sorted[$i]]->[0]}), "\n" unless match($i);
}

sub match {
	my $index = shift;
	my $target_set = $lines[$sorted[$index]]->[0];

	for (my $i = 0; $i < $index; $i++){
		my $is_subset = 1;
		my $bigger_set = $lines[$sorted[$i]]->[0];
		for my $element (keys %$target_set){
			unless (exists $bigger_set->{$element}){
				$is_subset = 0;
				last;
			}
		}
		return 1 if $is_subset;
	}
}

exit(0);