[Biopython] fastq manipulations speed

Sun Mar 17 20:04:08 UTC 2013

Hi biopython list,

I have a few fasta files that come from processing fastq illumina reads for quality, polyAs adaptors etc, but i need to get their associated qualities back. I wrote a simple script that calls the following 2 funbctions, which I think are the fastest way to deal with fastq-fasta files in Biopython, but the script is awfully slow: 
For example, for one of my files, after 41h of run, only 28000 records out of 28 million have been processed. My files contain between 28-40 million reads, so i need to somehow make it faster if this is possible. Any ideas or any things you might see in the code that make iot so slow?

def makeDictofFasta_withLgth(fastafile):
    '''tested on Illumina IIx files after my cleaning routine, ie sequences used in velvet'''
    mydict={}
    info= SeqIO.index(fastafile, "fasta")
    for rec in info.keys():
        mydict[rec]=len(info[rec].seq)
    print 'finished making dictionary of fasta records'
    return mydict

def Addquals_inTrimmedFastA(fastq, newfastq, fasta):
    outfastq=open(newfastq, "w")
    fasta_dict=makeDictofFasta_withLgth(fasta)
    fq_dict = SeqIO.index(fastq,"fastq-illumina")

    for record in fasta_dict.keys():
        if record in fq_dict.keys():
            length= fasta_dict[record]
            sub_rec = fq_dict[record][0:length]
            outfastq.write(sub_rec.format('fastq-illumina'))
    outfastq.close()

Thanks in advance, 

Natassa