[Biopython] removing redundant sequence

Wed Apr 21 15:10:45 UTC 2010

On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Peter,
> Sorry for the delayed reply. Yes i want to remove those sequences that are
> 100% identical but they have different identifier. I created a sample fasta
> file with two redundant sequences. But when i use checksums seguid to spot
> the redundancies, it spots only the first one.
>
> In [36]: for record in SeqIO.parse(open('t'),'fasta'):
>   ....:     print record.id, seguid(record.seq)
>   ....:
>   ....:
> A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
> *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
> AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
> AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
> AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
> AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
> *
> In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
> rec:seguid(rec.seq))
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
>
> /home/cbala/test/<ipython console> in <module>()
>
> /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
> to_dict(sequences, key_function)
>    585         key = key_function(record)
>    586         if key in d :
> --> 587             raise ValueError("Duplicate key '%s'" % key)
>    588         d[key] = record
>    589     return d
>
> ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'

Hi Bala,

You know there are duplicate sequences in your file, so if you try
to use the SEGUID as a key, there will be duplicate keys. Thus
you get this error message. If you want to use Bio.SeqIO.to_dict
you have to have unique keys.

What you should do is loop over the records and keep a record
of the checksums you have saved, and use that to ignore duplicates.
I would use a python set rather than a python list for speed.

You could do this with a for loop. However, I would probably use an
iterator based approach with a generator function - I think it is more
elegant but perhaps not so easy for a beginner:

from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

def remove_dup_seqs(records):
    """"SeqRecord iterator to removing duplicate sequences."""
    checksums = set()
    for record in records:
        checksum = seguid(record.seq)
        if checksum in checksums:
            print "Ignoring %s" % record.id
            continue
        checksums.add(checksum)
        yield record

records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta"))
count = SeqIO.write(records, "no_dups.fasta", "fasta")
print "Saved %i records" % count

Note I've used filename with Bio.SeqIO which requires Biopython 1.54b
or later - for older versions use handles. See also:

http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/

Peter