[Bioperl-l] Fishing redundant sequences in FASTA files [Right formatting]

Tue Feb 15 15:02:38 EST 2011

Juan,

If you are checking for simple complete matches, I would suggest using a hash.  However, you are also looking for partial matches as well.  In this case it seems like you should be (ab)using something akin to mcl to cluster like sequences together; you're essentially performing an all-v-all comparison anyway, at least take advantage of faster tools.  

So, basically:

1) Run an all-v-all comparison, filtering on 100% identity, no gaps
2) cluster using mcl

Note the BLAST-related programs here for that purpose:

http://www.micans.org/mcl/man/mclfamily.html

I think you can also use other tools instead of BLAST, just can't recall the mcl pipeline at the moment to use.

chris

On Feb 15, 2011, at 1:34 PM, Juan Jovel wrote:

> Good Morning guys,
>> sorry for the naive question: What's the simplest way to fish redundant sequences (complete or partial) between two (or more) fasta files.
>> I was thinking just to do it with SeqIO, opening two files, and compare each sequence of file_1 to each record of file_2, like:
> # Read each record of file 1 and compare to each read of
> file 2
> 
> while(my $dna1 = $seqin1->next_seq){
> 
>        my $seq1 =
> $dna1->seq;
> 
>        my $id1 =
> $dna1->id;
> 
> 
> 
>        # Iterate
> inside de second fasta file
> 
>        while(my $dna2
> = $seqin2->next_seq){
> 
>                my $seq2 = $dna2->seq;
> 
>                my
> $id2 = $dna2->id;
> 
> 
> 
> 
> if(($seq1 =~ /$seq2/)||($seq2 =~ /$seq1/)){
> 
> 
> print "Match found \n";
> 
> 
> print OUT "Records $id1 and $id2 are redundants";
> 
> I am afraid it is going to be slow for large files.  AND, more importantly, how do I reset the object containing the second file to the first line, as done in Perl with (SEEK(IN, 0,0)) for example.  Does SeqIO allows that (sorry, I am not a frequent user of SeqIO). If there is another more-elaborated module to fish such redundant sequences, I will appreciate to know.
> 
> Thanks,
> JUAN