[Biopython] removing redundant sequence
Peter
biopython at maubp.freeserve.co.uk
Wed Apr 21 15:10:45 UTC 2010
On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Peter,
> Sorry for the delayed reply. Yes i want to remove those sequences that are
> 100% identical but they have different identifier. I created a sample fasta
> file with two redundant sequences. But when i use checksums seguid to spot
> the redundancies, it spots only the first one.
>
> In [36]: for record in SeqIO.parse(open('t'),'fasta'):
> ....: print record.id, seguid(record.seq)
> ....:
> ....:
> A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8
> *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw
> AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw*
> AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA
> AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM
> AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ
> *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY
> AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY
> *
> In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda
> rec:seguid(rec.seq))
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
>
> /home/cbala/test/<ipython console> in <module>()
>
> /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in
> to_dict(sequences, key_function)
> 585 key = key_function(record)
> 586 if key in d :
> --> 587 raise ValueError("Duplicate key '%s'" % key)
> 588 d[key] = record
> 589 return d
>
> ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw'
Hi Bala,
You know there are duplicate sequences in your file, so if you try
to use the SEGUID as a key, there will be duplicate keys. Thus
you get this error message. If you want to use Bio.SeqIO.to_dict
you have to have unique keys.
What you should do is loop over the records and keep a record
of the checksums you have saved, and use that to ignore duplicates.
I would use a python set rather than a python list for speed.
You could do this with a for loop. However, I would probably use an
iterator based approach with a generator function - I think it is more
elegant but perhaps not so easy for a beginner:
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
def remove_dup_seqs(records):
""""SeqRecord iterator to removing duplicate sequences."""
checksums = set()
for record in records:
checksum = seguid(record.seq)
if checksum in checksums:
print "Ignoring %s" % record.id
continue
checksums.add(checksum)
yield record
records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta"))
count = SeqIO.write(records, "no_dups.fasta", "fasta")
print "Saved %i records" % count
Note I've used filename with Bio.SeqIO which requires Biopython 1.54b
or later - for older versions use handles. See also:
http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/
Peter
More information about the Biopython
mailing list