[Biopython] fastq manipulations speed

natassa natassa_g_2000 at yahoo.com
Sun Mar 17 23:02:12 UTC 2013


Thanks Peter, 
I am using python quite often, but I was missing the fact that I don't need the keys method (!), and I always used dictionaries instead of sets. I am not very clear though about this point, I mean, is the use of set faster or not in general?
I have adapted the script according to your suggestions, see below. I am not sure also about the format point you raised, I mean, is calling an internal function used in SeqIO write slower than SeqIO? 


def Addquals_inTrimmedFastA(fastq, newfastq, fasta):
    outfastq=open(newfastq, "w")
    length_dict = dict((rec.id, len(rec)) for rec in
                       SeqIO.parse(fasta, "fasta"))

    for record in SeqIO.parse(fastq,"fastq-illumina"):
        if record.id in length_dict:
            #print 'found: '+record.id
            length=length_dict[record.id]
            #print 'will write the substring from 0 to: '+str(length-1)
            sub_rec = record[0:(length-1)]
            SeqIO.write(sub_rec, outfastq, 'fastq-illumina')
    outfastq.close()


Will let you know if this is faster, I appreciate a lot the advise, which i really need to improve my programming skills!I have exactly the tendency of making things more complicated than they are.

As for the trimming, yes it was done only on the end, based on thresholds I obtained by plots using fastq files and the fastx toolkit. I then imposed the thresholds in a hard-coded way in all my files, and proceeded with further trimming of polyAs and NNs in the resulting fastas/ I know it must have been better to have kept the qualities, and that this trimming method is not the most common but this was done a long time ago, when i was not sure about many things and  this routine made sense to me (I mean, I stil think there is nothing wrong about it). I also knew that i would not need these qualities, since the assembler I used does not take them into account. So I simply thought that dragging info that is not necessary would be just a pain, as files are also bigger.
Thanks, 

Natassa


________________________________
 From: Peter Cock <p.j.a.cock at googlemail.com>
To: natassa <natassa_g_2000 at yahoo.com> 
Sent: Sunday, March 17, 2013 2:38 PM
Subject: Re: [Biopython] fastq manipulations speed
 
On Sun, Mar 17, 2013 at 8:04 PM, natassa <natassa_g_2000 at yahoo.com> wrote:
> Hi biopython list,
>
> I have a few fasta files that come from processing fastq illumina reads
> for quality, polyAs adaptors etc, but i need to get their associated
> qualities back. I wrote a simple script that calls the following 2
> funbctions, which I think are the fastest way to deal with fastq-fasta files
> in Biopython, but the script is awfully slow:

You're using some relatively sophisticated bits of SeqIO,
but you've made it more complicated than it needs to be.

> For example, for one of my files, after 41h of run, only 28000 records out
> of 28 million have been processed. My files contain between 28-40 million
> reads, so i need to somehow make it faster if this is possible. Any ideas or
> any things you might see in the code that make iot so slow?
>
> def makeDictofFasta_withLgth(fastafile):
>     '''tested on Illumina IIx files after my cleaning routine, ie
> sequences used in velvet'''
>     mydict={}
>     info= SeqIO.index(fastafile, "fasta")
>     for rec in info.keys():
>         mydict[rec]=len(info[rec].seq)
>     print 'finished making dictionary of fasta records'
>     return medicate

In this example you don't need the index - all you need to do is
one loop over the file while building up the dictionary of lengths.
e.g.

length_dict = {}
for rec in SeqIO.parse(fastafile, "fasta"):
    length_dict[rec.id] = len(rec)

Or more elegantly using a generator expression:

length_dict = dict((rec.id, len(rec)) for rec in
SeqIO.parse(fastafile, "fasta"))

By using an index like this, and looping over the reads in
whatever order the dictionary uses (based on the hashing
algorithm), you are doing a lot of wasteful disk access
jumping back and forth in the FASTA file. This will I think
explain the main source of slowness in your script.

> def Addquals_inTrimmedFastA(fastq, newfastq, fasta):
>     outfastq=open(newfastq, "w")
>     fasta_dict=makeDictofFasta_withLgth(fasta)
>     fq_dict = SeqIO.index(fastq,"fastq-illumina")
>
>     for record in fasta_dict.keys():
>         if record in fq_dict.keys():
>             length= fasta_dict[record]
>             sub_rec = fq_dict[record][0:length]
>             outfastq.write(sub_rec.format('fastq-illumina'))
>     outfastq.close()

Again, you don't need the index. Doing this you'll process
the reads in the sorted order Python uses for the dictionary
(essentially random order), meaning lots and lots of
wasted and slow disk access as the Biopython indexing
jumps to the records in the FASTQ file in this arbitrary
order.

Just make a single loop over the file with SeqIO.parse.

Also don't use the SeqRecord's format method like that -
the help text tries to direct you to the SeqIO.write method
which will be faster (the format method calls this internally).

With those changes you should get far more sensible
run times - but there is still a lot of room for improvement.
Do you want to try out the suggestions so far, and then
we can make a second round of feedback? That would
be my recommendation if you are hoping to improve
your Python skills.

As an aside, are you *sure* your trimming pipeline has
only trimmed the ends of the sequences? That seems
to be the assumption you've made here - but if you
have any barcodes they'll be trimmed from the start
of the sequences.

In general it would be far better to do the trimming on
the FASTQ files, rather than on the FASTA files and
then trying to fix the qualities. At Chris pointed out,
there are existing well tested quality control/trimming
libraries which might be worth checking.

I hope that is useful,

Peter



More information about the Biopython mailing list