[Biopython] large fasta files

Jurgens de Bruin debruinjj at gmail.com
Tue Sep 9 12:12:05 UTC 2014


Hi,

Thanks for the reply I am trying out the
Bio.SeqIO.FastaIO.SimpleFastaParser, what I want to achieve is to iterate
over the fasta and pull out sequences that are in a predefined list, based
on id and then write these to a new fasta file.



On 9 September 2014 11:38, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Tue, Sep 9, 2014 at 8:54 AM, Jurgens de Bruin <debruinjj at gmail.com>
> wrote:
> > Hi All,
> >
> > I would like some advice on iterating over large fasta files 208MB
> total of
> > 1813132 sequences.  Currently using SeqIO.parse but seems very very
> slow. I
> > would appreciate any help on this matter.
>
> Do you need to look at each record one-by-one? If so, iterating over
> the file in one pass is best, and if Bio.SeqIO.pase(..., "fasta") is too
> slow then I suggest using Bio.SeqIO.FastaIO.SimpleFastaParser(...)
> which just returns tuples of strings (avoiding the memory and speed
> overhead of creating SeqRecord objects).
>
> Alternatively, it might be more efficient to jump to specific records
> of interest using Bio.SeqIO.index(..) or Bio.SeqIO.index_db(...).
>
> Peter
>



-- 
Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
distinti saluti/siong/duì yú/привет

Jurgens de Bruin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140909/930565da/attachment.html>


More information about the Biopython mailing list