[Biopython] removing redundant sequence

Wed Apr 21 14:25:35 UTC 2010

Peter,
Sorry for the delayed reply. Yes i want to remove those sequences that are
100% identical but they have different identifier. I created a sample fasta
file with two redundant sequences. But when i use checksums seguid to spot
the redundancies, it spots only the first one.

In [36]: for record in SeqIO.parse(open('t'),'fasta'):
   ....:     print record.id, seguid(record.seq)
   ....:
   ....:
A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
*AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
*AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
*
In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
rec:seguid(rec.seq))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/cbala/test/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
to_dict(sequences, key_function)
    585         key = key_function(record)
    586         if key in d :
--> 587             raise ValueError("Duplicate key '%s'" % key)
    588         d[key] = record
    589     return d

ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'

On Tue, Apr 13, 2010 at 5:02 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian
> <bala.biophysics at gmail.com> wrote:
> > Friends,
> > Sorry if this question was asked before. Is there any function in
> Biopython
> > that can remove redundant sequence records from a fasta file.
> >
> > Thanks,
> > Bala
>
> No, but you should be able to do this with Biopython - depending on
> what exactly you are asking for.
>
> When you say "redundant" do you mean 100% perfect identify?
>
> How big is your FASTA file - are you working with next-gen sequencing
> data and millions of reads?. If it is small enough you can keep all
> the data in memory to compare sequences to each other. Otherwise
> you might try using a checksum (e.g. SEGUID) to spot duplicates.
>
> Peter
>