[Biopython] large fasta files

Jurgens de Bruin debruinjj at gmail.com
Tue Sep 9 13:30:52 UTC 2014


Hi,

Again thanks and the example helped a lot. After running the script I
notice that the fasta files has a few duplicates sequences in. What would
be the best approach to remove the duplicates.


On 9 September 2014 15:04, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Tue, Sep 9, 2014 at 1:55 PM, Jurgens de Bruin <debruinjj at gmail.com>
> wrote:
> > Hi,
> >
> > So the id I am matching to are in a set .
>
> Good :)
>
> > if seq.id in lset_id:
> >    list_seq.append(seq)
>
> This looks like you are building a list of SeqRecord object in memory.
> If you are looking for a large number of entries in the FASTA file, this
> will consume a lot of RAM (and if you run out or RAM will suddenly
> slow down as swap space is used instead).
>
> I would use a generator approach to write out the records you want
> immediately, see the "Filtering a sequence file" example in the
> Cookbook chapter of the Biopython Tutorial:
>
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> In your case, replace "sff" with "fasta" and adjust how the set of
> wanted identifiers is loaded.
>
> Peter
>



-- 
Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
distinti saluti/siong/duì yú/привет

Jurgens de Bruin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140909/ed592b56/attachment-0001.html>


More information about the Biopython mailing list