[Biopython] large fasta files

Ivan Gregoretti ivangreg at gmail.com
Tue Sep 9 15:20:46 UTC 2014


Hello Jurgens and Peter,

I use these strategy and it is extremely fast:

file_handle = open('file_name.fa', 'r')

ids_set = set()

for i in SeqIO.parse(file_handle, 'fasta'):
    ids_set.add(i.id)


I hope this helps.

Ivan


Ivan Gregoretti, PhD

On Tue, Sep 9, 2014 at 9:11 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Thanks for all the help much appreciated!
>
>
> On 9 September 2014 15:04, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com>
>> wrote:
>> > Hi,
>> >
>> > So the id I am matching to are in a set .
>>
>> Good :)
>>
>> > if seq.id in lset_id:
>> >    list_seq.append(seq)
>>
>> This looks like you are building a list of SeqRecord object in memory.
>> If you are looking for a large number of entries in the FASTA file, this
>> will consume a lot of RAM (and if you run out or RAM will suddenly
>> slow down as swap space is used instead).
>>
>> I would use a generator approach to write out the records you want
>> immediately, see the "Filtering a sequence file" example in the
>> Cookbook chapter of the Biopython Tutorial:
>>
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> In your case, replace "sff" with "fasta" and adjust how the set of
>> wanted identifiers is loaded.
>>
>> Peter
>
>
>
>
> --
> Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
> distinti saluti/siong/duì yú/привет
>
> Jurgens de Bruin
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list