[Biopython] large fasta files

Peter Cock p.j.a.cock at googlemail.com
Tue Sep 9 09:38:00 UTC 2014


On Tue, Sep 9, 2014 at 8:54 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi All,
>
> I would like some advice on iterating over large fasta files 208MB  total of
> 1813132 sequences.  Currently using SeqIO.parse but seems very very slow. I
> would appreciate any help on this matter.

Do you need to look at each record one-by-one? If so, iterating over
the file in one pass is best, and if Bio.SeqIO.pase(..., "fasta") is too
slow then I suggest using Bio.SeqIO.FastaIO.SimpleFastaParser(...)
which just returns tuples of strings (avoiding the memory and speed
overhead of creating SeqRecord objects).

Alternatively, it might be more efficient to jump to specific records
of interest using Bio.SeqIO.index(..) or Bio.SeqIO.index_db(...).

Peter


More information about the Biopython mailing list