[Bioperl-l] how-to-remove-redundant-lines

Wed Jun 29 05:32:09 EDT 2005

>>>>> "vijayaraj" == vijayaraj nagarajan <bioinfovijayaraj at yahoo.com> writes:
vijayaraj> i have a cluster file with contents like this:

vijayaraj> 1 2 5 7 8 11
vijayaraj> 2 5 7 8 11 
vijayaraj> 3 13 17 19
vijayaraj> 4 21 45 67
vijayaraj> 5 7 8 11

vijayaraj> Now the 1,2 and 5th lines are redundant. i need to
vijayaraj> remove the 2nd and 5th line from the file, while
vijayaraj> retaining only the first line, since the first line
vijayaraj> contains all the members present in 2 and 5th line...

Are there any constraints on your data that might help solve this?

For example, do the numbers always have exactly one space between
them? Do the numbers always appear in ascending order? Is there ever
any trailing whitespace on a line (there was a space at the end of
your second line).

If the answers to the above are yes, yes, and no, then the following
works. If not, you'll need to do a little more to canonicalize each
line (e.g., strip spaces, sort the numbers, etc).

#!/usr/bin/perl -w

use strict;
my @lines;

while (my $line = <>){
    next if grep /\Q$line\E/, @lines;
    push @lines, $line;
    print $line;
}

or if you feel like being obscure today, you can do it straight from
the command line:

  perl -ne '$l = $_; push(@lines, $l), print($l) unless grep /\Q$l\E/, @lines' < data

Terry