[Bioperl-l] removing duplicate fasta records

Jonathan Epstein Jonathan_Epstein@nih.gov
Tue, 17 Dec 2002 15:53:57 -0500


Option #1:
  create a hash with all the sequences as you read them, and check for duplicates by seeing whether that hash element already exists

Option #2 (slightly harder):
  create such a hash and check for duplicates, but instead of hashing the sequences, hash a checksum of the sequences such as MD5:
    http://search.cpan.org/author/GAAS/Digest-MD5-2.20/MD5.pm
or the GCG checksum:
    http://search.cpan.org/author/BIRNEY/bioperl-1.0.2/Bio/SeqIO/gcg.pm

This requires more CPU time, but much less memory.


Option #1 is quick-and-dirty, and is appropriate if your input file contains only a few megabytes of data (or less).

Jonathan


At 12:41 PM 12/17/2002 -0700, Amit Indap <indapa@cs.arizona.edu> wrote:
>I have a file with a list of fasta sequences. Is there a way to 
>remove records with the identical sequence? I am a newbie to BioPerl,
>and my search through the documentation hasn't found anything.
>
>Thank you.
>
>Amit Indap