[Biopython] remove list redundancy

Fri Mar 23 22:23:13 UTC 2012

Hi Ferreira,

As Iddo have mentioned, it's better to build a new list containing
unique records instead. Here's my shot at a method like that:

from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

# Returns a list containing unique SeqRecord list.
def remove_redundant(fastafile):
    records = SeqIO.parse(fastafile, 'fasta')

    # new list container
    unique_records = []
    # unique sequence checksum container
    checksum_container = []

    for record in records:
        checksum = seguid(record.seq)
        if checksum not in checksum_container:
            unique_records.append(record)

    return unique_records

I assume your Fasta file is not very big since you opted to load
everything into memory in your initial script. If it were big, you
could change the method into a generator to save memory, by writing
this instead:

    # previous lines...
    for record in records:
        checksum = seguid(record.seq)
        if checksum not in checksum_container:
            yield record

And iterating over the function like so:

    for unique_record in remove_redundant(fastafile):
        # process the records here

Hope that helps,
---
Bow

On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg <idoerg at gmail.com> wrote:
> Python assigns by reference, not by value. So you can have the following:
>
>>>> a=[1,2,3]
>>>> b=a
>>>> print b
> [1, 2, 3]
>>>> del b[1]
>>>> print a
> [1, 3]
>>>>
>
> So if you remove an item from list b, it will remove it from a as well.
> Which is why in your case, record and new_rec end up the same, since they
> were the same to start off with.
>
> Furthermore, in your loop, you are changing the length of "record" which is
> the target of a for loop. Never a good idea and yields unexpected results.
> Finally, the index "j" you are using points to one thing in record, but
> will point to another thing in new_rec.
>
> You can do an assignment by value using the copy module
> new_rec=copy.copy(record)
>
> That will create a completely new copy of record in new_rec.
>
> That still won't solve the problem that you have in the shifting place "j"
> points to in the loop though.
>
> I would suggest building a list of non-redundant sequences rather than
> deleting from a list of redundant sequences.
>
>
> HTH,
>
> Iddo
>
> On Fri, Mar 23, 2012 at 5:55 PM, <ferreirafm at usp.br> wrote:
>
>> Hi Biopy users,
>> I have a mult-sequence fasta file which I've read as a list. Is there a
>> clever way/method to remove redundant sequences?
>> Thanks in advance,
>> Fred
>>
>> ### CODE:
>>    def redundancy(fastafile):
>>    f=open(fastafile, 'r')
>>    record = list(SeqIO.parse(f,"fasta"))
>>    new_rec = record
>>    f.close
>>    print len(record)
>>    for i in range(len(record)):
>>        for j in range(len(record)):
>>            if i < j:
>>                if record[i].seq == record[j].seq:
>>                    del new_rec[j]
>>     print len(new_rec)
>>
>>
>> ### RESULTS:
>> $ redundancy.py -run all_emm_fake.fasta
>> 823
>> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In
>> future comparing Seq objects will use string comparison (not object
>> comparison). Incompatible alphabets will trigger a warning (not an
>> exception). In the interim please use id(seq1)==id(seq2) or
>> str(seq1)==str(seq2) to make your code explicit and to avoid this warning.
>>  "and to avoid this warning.", FutureWarning)
>> 823
>>
>> ### EXPECTING:
>> Worse, the function above is not working. I was expecting 823 before and
>> 822 after running it.
>>
>>
>>
>>
>> ______________________________**_________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>>
>
>
>
> --
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>>----.<--.>++++++.<<<<------------------------------------.
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython