[Biopython] Adaptor trimmer and dimers

Wed Oct 21 06:18:09 EDT 2009

On Wed, Oct 21, 2009 at 10:54 AM, natassa <natassa_g_2000 at yahoo.com> wrote:
>
> My main problem now is performance of this script: On a file of
> 19 million reads of 76 bp it is running for more than 12 hours!
> So I copy here my code and would be very grateful if someone
> could indicate parts where it could be sped up.

The best way to answer that is to run some profiling yourself.
I would just make a small test file, and profile that.

> I am not very good in python for sure, but I am also pretty sure
> this is not an endless loop problem and I have run out of ideas
> how to make it faster (unless I abandon working with Seq Records).
> I am seriously thinking of inputting Fastas instead of Fastq-illumina
> files, but for a whole bunch of tests I am running now, being
> able to work with Fastq would be ideal...

You are using Bio.SeqIO to parse the FASTQ files, but you don't
use the quality scores as all. Therefore it would be faster to use
FASTA files, or keep working with FASTQ files but switch from
using SeqRecords to simple strings as described here:

http://lists.open-bio.org/pipermail/biopython/2009-August/005430.html
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter