[Biopython] Adaptor trimmer and dimers

Wed Oct 21 09:54:26 UTC 2009

Brad, 
Thank you for the tips. I adapted your code a bit to handle pairs (that is, I have both read1 and 2 of a pair in the same file and if I find the adaptor in any read of the pair, I discard the pair.) I also had to add an additional test for the length of the alignment output, as I got an index Error for the cases the adapter does not align at all. I am not sure i got this part right, I looked a bit at the related Biopython alignment code, and that is what  I concluded. 
My main problem now is performance of this script: On a file of 19 million reads of 76 bp it is running for more than 12 hours! So I copy here my code and would be very grateful if someone could indicate parts where it could be sped up. Also, Brad, could you check this extra test line in the handle_adaptor function? 
I am not very good in python for sure, but I am also pretty sure this is not an endless loop problem and I have run out of ideas how to make it faster (unless I abandon working with Seq Records). I am seriously thinking of inputting Fastas instead of Fastq-illumina files, but for a whole bunch of tests I am running now, being able to work with Fastq would be ideal...
Hope this is just a silly mistake of mine..
Here is the code:

from Bio import SeqIO
import os
from Bio import pairwise2
from Bio.Seq import Seq

def handle_adaptor(record, adaptor, num_errors):
    '''returns 1 if no adaptor found as exact match or as a a pairwise alignment allowing two errors. Otherwise: none'''
    gap_char = '-'
    exact_pos = str(record.seq).find(adaptor)
    #exact match
    if exact_pos >= 0:
        seq_region = str(record.seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
    else:
        if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) ==0:
#no alignment at all
           return 1
        else:

            if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) >=1:   
                seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True,
                                                                              gap_char=gap_char)[0]

                adapt_region = adaptor_a[start:end]
                seq_region = seq_a[start:end]

    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
                  enumerate(seq_region))

    # too many errors -- 
    if (len(adaptor) - matches) > num_errors:
                      return 1      

def Handle_shuffledFiles (path, number_of_adaptor, num_errors):
    all_files=os.listdir(path) 
    for file in all_files:
        if not file.endswith('fastq'):
            continue
        else:
            if '_afr_' in file :
                print "working on : "+file + "..." 

                if number_of_adaptor==1:
                    adaptor='ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
                    output=path+'Adaptor1'+'_removedNat/'+file+'_Clean.txt'
                elif number_of_adaptor==2:
                    adaptor= 'TCTAGCCTTCTCGCCAAGTCGTCCTTACGGCTCTGGC'  
                    output=path+'Adaptor2'+'_removedNat/'+file+'_Clean.txt'

                out_handle=open(output, "w")

                iter = SeqIO.parse(open(path+file), "fastq-illumina")
                j=0
                k=0
                try:
                    while 1:
                        rec1 = iter.next()    
                        rec2 = iter.next()
                        k=k+1

                        Ad_inR1 = handle_adaptor(rec1, adaptor, num_errors  ) #returns 1 if no adaptor found or if found with >2 mismatches
                        Ad_inR2 = handle_adaptor(rec2, adaptor, num_errors  ) 

                        if Ad_inR1 and Ad_inR2:
                            j=j+1
                            print 'Counting the %i th pair that has no adaptor ...' %j 

                            SeqIO.write([rec1, rec2], out_handle, "fastq-illumina")

                except StopIteration, e:
                    pass

                out_handle.close() 
                print '..out of %i pairs total' %k   

if __name__ == "__main__":
    path2Fastq="/Users/nat/Data/Illumina/Restricted_forTests/Fastq-Illumina/shuffled/"
    Handle_shuffledFiles(path2Fastq, 1,  2) 

Thanks!
Anastasia 

Post-Doc, Evolutionary Biology Department
Upssala University
Norbyvägen 18D
SE-752 36  UPPSALA
anastasia.gioti at ebc.uu.se
Tel: +46-18-471 2837
Fax: +46-18-471 6310