[Biopython] Writing large Fastq

Peter Cock p.j.a.cock at googlemail.com
Thu Aug 14 10:10:14 UTC 2014


Hi Jurgens,

The first problem is you appear to be using a list of desired identifiers.
Python lists scale poorly (membership checking is slow for large lists).
Use a Python set instead (this uses hashes and is much faster).

Secondly, for something like this you don't need to waste time
decoding the FASTQ quality scores into integers, and back to
strings. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter

On Thu, Aug 14, 2014 at 8:03 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi All,
>
> I would appreciate any help on the following, I have script that that
> filters read from a FASQ files files are 1.5GB in size. I want to write the
> filtered read to a new fastq file and this is where I seem to have  bug as
> the writing of the file newer finishes I have left the script for 4 hours
> and nothing so I stop the script. This is currently what I have :
>
> from Bio import SeqIO
> fastq_parser = SeqIO.parse(ls_file,ls_filetype)
> wanted = (rec for rec in fastq_parser if rec.description in ll_llist )
> ls_filename = "%s_filered.fastq"%ls_file.split(".")[0]
> handle = open(ls_filename,'wb')
> SeqIO.write(wanted, handle , "fastq")
> handle.close()
>
> Thanks inadvance
> --
> Regards/Groete/Mit freundlichen Grüßen/recuerdos/meilleures salutations/
> distinti saluti/siong/duì yú/привет
>
> Jurgens de Bruin
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list