[Biopython] removing redundant sequence

Peter biopython at maubp.freeserve.co.uk
Tue Apr 13 15:02:52 UTC 2010


On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Friends,
> Sorry if this question was asked before. Is there any function in Biopython
> that can remove redundant sequence records from a fasta file.
>
> Thanks,
> Bala

No, but you should be able to do this with Biopython - depending on
what exactly you are asking for.

When you say "redundant" do you mean 100% perfect identify?

How big is your FASTA file - are you working with next-gen sequencing
data and millions of reads?. If it is small enough you can keep all
the data in memory to compare sequences to each other. Otherwise
you might try using a checksum (e.g. SEGUID) to spot duplicates.

Peter



More information about the Biopython mailing list