[Biopython] matching headers and then writing the seq record
Dilara Ally
dilara.ally at gmail.com
Thu Jul 26 17:48:44 UTC 2012
Hi Everyone,
I'm interested in finding headers that match (in other words paired reads) in two different fastq files. Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file. Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines. And I was wondering if there was a more elegant way and time efficient manner than what I have done. It seems as if set undoes the elegance of using a generator. Any advice is greatly appreciated! Here is the code:
def get_header(seq_record):
fields = seq_record.id.split(':')
lastfield = fields[6].split('_')[0]
return lastfield
def get_full_header(seq_record):
fields = seq_record.id.split(':')
headerInfo2 = fields[6].split('_')[0]
headerInfo = str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2)
return headerInfo
def replace_header(seq_record,pairType):
if pairType == 1:
ending = "/1"
elif pairType == 2:
ending = "/2"
seq_record.id=seq_record.id+ending
seq_record.name = ""
seq_record.description = ""
return seq_record
def matched_records(records, pairType, header_matches):
for rec in records:
id = get_header(rec)
result = id in header_matches
#print result
if (result == True):
newrec = replace_header(rec,pairType)
yield newrec
import sys
from Bio import SeqIO
headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq"))
headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq"))
header_matches = [x for x in headers_read1 if x in headers_read2]
records = SeqIO.parse(sys.argv[1], "fastq")
pairType = 1
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq")
print "Saved %i matched reads." %count
records = SeqIO.parse(sys.argv[2], "fastq")
pairType = 2
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq")
print "Saved %i matched reads." %count
More information about the Biopython
mailing list