[Biopython] Adaptor trimmer and dimers
natassa
natassa_g_2000 at yahoo.com
Wed Oct 21 05:54:26 EDT 2009
Brad,
Thank you for the tips. I adapted your code a bit to handle pairs (that is, I have both read1 and 2 of a pair in the same file and if I find the adaptor in any read of the pair, I discard the pair.) I also had to add an additional test for the length of the alignment output, as I got an index Error for the cases the adapter does not align at all. I am not sure i got this part right, I looked a bit at the related Biopython alignment code, and that is what I concluded.
My main problem now is performance of this script: On a file of 19 million reads of 76 bp it is running for more than 12 hours! So I copy here my code and would be very grateful if someone could indicate parts where it could be sped up. Also, Brad, could you check this extra test line in the handle_adaptor function?
I am not very good in python for sure, but I am also pretty sure this is not an endless loop problem and I have run out of ideas how to make it faster (unless I abandon working with Seq Records). I am seriously thinking of inputting Fastas instead of Fastq-illumina files, but for a whole bunch of tests I am running now, being able to work with Fastq would be ideal...
Hope this is just a silly mistake of mine..
Here is the code:
from Bio import SeqIO
import os
from Bio import pairwise2
from Bio.Seq import Seq
def handle_adaptor(record, adaptor, num_errors):
'''returns 1 if no adaptor found as exact match or as a a pairwise alignment allowing two errors. Otherwise: none'''
gap_char = '-'
exact_pos = str(record.seq).find(adaptor)
#exact match
if exact_pos >= 0:
seq_region = str(record.seq[exact_pos:exact_pos+len(adaptor)])
adapt_region = adaptor
else:
if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) ==0:
#no alignment at all
return 1
else:
if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) >=1:
seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True,
gap_char=gap_char)[0]
adapt_region = adaptor_a[start:end]
seq_region = seq_a[start:end]
matches = sum((1 if s == adapt_region[i] else 0) for i, s in
enumerate(seq_region))
# too many errors --
if (len(adaptor) - matches) > num_errors:
return 1
def Handle_shuffledFiles (path, number_of_adaptor, num_errors):
all_files=os.listdir(path)
for file in all_files:
if not file.endswith('fastq'):
continue
else:
if '_afr_' in file :
print "working on : "+file + "..."
if number_of_adaptor==1:
adaptor='ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
output=path+'Adaptor1'+'_removedNat/'+file+'_Clean.txt'
elif number_of_adaptor==2:
adaptor= 'TCTAGCCTTCTCGCCAAGTCGTCCTTACGGCTCTGGC'
output=path+'Adaptor2'+'_removedNat/'+file+'_Clean.txt'
out_handle=open(output, "w")
iter = SeqIO.parse(open(path+file), "fastq-illumina")
j=0
k=0
try:
while 1:
rec1 = iter.next()
rec2 = iter.next()
k=k+1
Ad_inR1 = handle_adaptor(rec1, adaptor, num_errors ) #returns 1 if no adaptor found or if found with >2 mismatches
Ad_inR2 = handle_adaptor(rec2, adaptor, num_errors )
if Ad_inR1 and Ad_inR2:
j=j+1
print 'Counting the %i th pair that has no adaptor ...' %j
SeqIO.write([rec1, rec2], out_handle, "fastq-illumina")
except StopIteration, e:
pass
out_handle.close()
print '..out of %i pairs total' %k
if __name__ == "__main__":
path2Fastq="/Users/nat/Data/Illumina/Restricted_forTests/Fastq-Illumina/shuffled/"
Handle_shuffledFiles(path2Fastq, 1, 2)
Thanks!
Anastasia
Post-Doc, Evolutionary Biology Department
Upssala University
Norbyvägen 18D
SE-752 36 UPPSALA
anastasia.gioti at ebc.uu.se
Tel: +46-18-471 2837
Fax: +46-18-471 6310
More information about the Biopython
mailing list