[Biopython] remove list redundancy

Fri Mar 23 18:39:59 EDT 2012

Ferreira,

I just realized I missed one important line:

        if checksum not in checksum_container:
            unique_records.append(record)

should be:

        if checksum not in checksum_container:
            checksum_container.append(checksum)
           unique_records.append(record)

Basically what the method does is it only adds the sequence record to
the unique_records list only if its sequence checksum is not present
in checksum_container already.

And apologies for the double mass-post everyone.

Have a nice weekend,
---
Bow

On Fri, Mar 23, 2012 at 23:23, Wibowo Arindrarto <w.arindrarto at gmail.com> wrote:
> Hi Ferreira,
>
> As Iddo have mentioned, it's better to build a new list containing
> unique records instead. Here's my shot at a method like that:
>
>
> from Bio import SeqIO
> from Bio.SeqUtils.CheckSum import seguid
>
> # Returns a list containing unique SeqRecord list.
> def remove_redundant(fastafile):
>    records = SeqIO.parse(fastafile, 'fasta')
>
>    # new list container
>    unique_records = []
>    # unique sequence checksum container
>    checksum_container = []
>
>    for record in records:
>        checksum = seguid(record.seq)
>        if checksum not in checksum_container:
>            unique_records.append(record)
>
>    return unique_records
>
> I assume your Fasta file is not very big since you opted to load
> everything into memory in your initial script. If it were big, you
> could change the method into a generator to save memory, by writing
> this instead:
>
>    # previous lines...
>    for record in records:
>        checksum = seguid(record.seq)
>        if checksum not in checksum_container:
>            yield record
>
> And iterating over the function like so:
>
>    for unique_record in remove_redundant(fastafile):
>        # process the records here
>
>
> Hope that helps,
> ---
> Bow
>
>
> On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg <idoerg at gmail.com> wrote:
>> Python assigns by reference, not by value. So you can have the following:
>>
>>>>> a=[1,2,3]
>>>>> b=a
>>>>> print b
>> [1, 2, 3]
>>>>> del b[1]
>>>>> print a
>> [1, 3]
>>>>>
>>
>> So if you remove an item from list b, it will remove it from a as well.
>> Which is why in your case, record and new_rec end up the same, since they
>> were the same to start off with.
>>
>> Furthermore, in your loop, you are changing the length of "record" which is
>> the target of a for loop. Never a good idea and yields unexpected results.
>> Finally, the index "j" you are using points to one thing in record, but
>> will point to another thing in new_rec.
>>
>> You can do an assignment by value using the copy module
>> new_rec=copy.copy(record)
>>
>> That will create a completely new copy of record in new_rec.
>>
>> That still won't solve the problem that you have in the shifting place "j"
>> points to in the loop though.
>>
>> I would suggest building a list of non-redundant sequences rather than
>> deleting from a list of redundant sequences.
>>
>>
>> HTH,
>>
>> Iddo
>>
>> On Fri, Mar 23, 2012 at 5:55 PM, <ferreirafm at usp.br> wrote:
>>
>>> Hi Biopy users,
>>> I have a mult-sequence fasta file which I've read as a list. Is there a
>>> clever way/method to remove redundant sequences?
>>> Thanks in advance,
>>> Fred
>>>
>>> ### CODE:
>>>    def redundancy(fastafile):
>>>    f=open(fastafile, 'r')
>>>    record = list(SeqIO.parse(f,"fasta"))
>>>    new_rec = record
>>>    f.close
>>>    print len(record)
>>>    for i in range(len(record)):
>>>        for j in range(len(record)):
>>>            if i < j:
>>>                if record[i].seq == record[j].seq:
>>>                    del new_rec[j]
>>>     print len(new_rec)
>>>
>>>
>>> ### RESULTS:
>>> $ redundancy.py -run all_emm_fake.fasta
>>> 823
>>> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In
>>> future comparing Seq objects will use string comparison (not object
>>> comparison). Incompatible alphabets will trigger a warning (not an
>>> exception). In the interim please use id(seq1)==id(seq2) or
>>> str(seq1)==str(seq2) to make your code explicit and to avoid this warning.
>>>  "and to avoid this warning.", FutureWarning)
>>> 823
>>>
>>> ### EXPECTING:
>>> Worse, the function above is not working. I was expecting 823 before and
>>> 822 after running it.
>>>
>>>
>>>
>>>
>>> ______________________________**_________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>>>
>>
>>
>>
>> --
>> Iddo Friedberg
>> http://iddo-friedberg.net/contact.html
>> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
>> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
>> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>>>----.<--.>++++++.<<<<------------------------------------.
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython