[Biopython] large fasta files

Ivan Gregoretti ivangreg at gmail.com
Tue Sep 9 15:49:23 UTC 2014


I would use a similar strategy but instead of

ids_seq = set()
...
ids_set.add(i.id)

I would create a set of sequences

seqs_set = set()
...
seqs_set.add(  str(i.seq)  )


Cheers,

Ivan



Ivan Gregoretti, PhD



On Tue, Sep 9, 2014 at 9:30 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> Again thanks and the example helped a lot. After running the script I notice
> that the fasta files has a few duplicates sequences in. What would be the
> best approach to remove the duplicates.
>
>
> On 9 September 2014 15:04, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com>
>> wrote:
>> > Hi,
>> >
>> > So the id I am matching to are in a set .
>>
>> Good :)
>>
>> > if seq.id in lset_id:
>> >    list_seq.append(seq)
>>
>> This looks like you are building a list of SeqRecord object in memory.
>> If you are looking for a large number of entries in the FASTA file, this
>> will consume a lot of RAM (and if you run out or RAM will suddenly
>> slow down as swap space is used instead).
>>
>> I would use a generator approach to write out the records you want
>> immediately, see the "Filtering a sequence file" example in the
>> Cookbook chapter of the Biopython Tutorial:
>>
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> In your case, replace "sff" with "fasta" and adjust how the set of
>> wanted identifiers is loaded.
>>
>> Peter
>
>
>
>
> --
> Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
> distinti saluti/siong/duì yú/привет
>
> Jurgens de Bruin
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list