[Biopython] matching headers and then writing the seq record

Dilara Ally dilara.ally at gmail.com
Thu Jul 26 17:48:44 UTC 2012


Hi Everyone, 

I'm interested in finding headers that match (in other words paired reads) in two different fastq files.  Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file.  Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines.  And I was wondering if there was a more elegant way and time efficient manner than what I have done.  It seems as if set undoes the elegance of using a generator.  Any advice is greatly appreciated!   Here is the code:

def get_header(seq_record):
    fields = seq_record.id.split(':')
    lastfield = fields[6].split('_')[0]
    return lastfield

def get_full_header(seq_record):
    fields = seq_record.id.split(':')
    headerInfo2 = fields[6].split('_')[0]
    headerInfo =  str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2)
    return headerInfo 

def replace_header(seq_record,pairType):
    if pairType == 1:
        ending = "/1"
    elif pairType == 2:
        ending = "/2"
    seq_record.id=seq_record.id+ending
    seq_record.name = ""
    seq_record.description = ""
    return seq_record

def matched_records(records, pairType, header_matches):
    for rec in records:
        id = get_header(rec)
        result = id in header_matches
        #print result
        if (result == True):
            newrec = replace_header(rec,pairType)
            yield newrec

import sys
from Bio import SeqIO

headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq"))
headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq"))
header_matches = [x for x in headers_read1 if x in headers_read2]

records = SeqIO.parse(sys.argv[1], "fastq") 
pairType = 1
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq")
print "Saved %i matched reads." %count

records = SeqIO.parse(sys.argv[2], "fastq") 
pairType = 2
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq")
print "Saved %i matched reads." %count



More information about the Biopython mailing list