[Bioperl-l] removing duplicate fasta records

Tue, 17 Dec 2002 12:10:58 -0800

This is a simple way you could do it.  The file some_name would contain 
non-redundant fasta entries (at least as far as the sequences go).  I think 
there is also an Index module in bioperl, so you might consider the code below 
a very lightweight way of doing it. :)

Cheers,

nathanael kuipers
---

use Bio::SeqIO;

my $file = shift;
my %already_seen;

my $stream = Bio::SeqIO->new( -file => $file );
my $writer = Bio::SeqIO->new( -file => ">>some_name", -format => 'Fasta' );

while ( my $seqobj = $stream->next_seq() ) {
     if ( exists $already_seen{$seqobj->seq} ) { next }
     else { $already_seen{$seqobj->seq}++; $writer->write_seq( $seqobj ); }
}

>===== Original Message From "Amit Indap <indapa@cs.arizona.edu>" 
<indapa@amadeus.biosci.arizona.edu> =====
>I have a file with a list of fasta sequences. Is there a way to
>remove records with the identical sequence? I am a newbie to BioPerl,
>and my search through the documentation hasn't found anything.
>
>Thank you.
>
>Amit Indap
>
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l