[Biopython] replace header

Dilara Ally dilara.ally at gmail.com
Wed May 30 03:30:27 UTC 2012


Hi Guys, 

I'm interested in replacing just one part of the header for every read in a 40Gb fastq file.  Because the files are so huge I don't want to read the entire file into the memory just the single read and then rewrite to a new file.   The problem as it stands is that I'm  creating all new SeqRecord object, appending a list called newsolid.  And then once that list is complete with all records,  I write that list to a new file.  

Preferably I'd like to write each new SeqRecord immediately to a file.  Sorry if I've missed this lesson in the Biopython tutorial and cook book!  Any help would be greatly appreciated!

Here is the code. 

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
newsolid=[]
for seq_record in SeqIO.parse("solid_1.fastq", "fastq"):
   print seq_record.id
   original_header=seq_record.id
   import re
   subfind=r"(\w+)_(\w+)"
   result=re.search(subfind, original_header)
   print result.groups()
   subheader="_1"
   subreplace=r"\1_1"
   new_header=re.sub(subfind, subreplace, original_header)
   print new_header
   newfastqrecord=SeqRecord(seq_record.seq, id=new_header, letter_annotations=seq_record.letter_annotations)
   newsolid.append(newfastqrecord)

output="newsolid_1.fastq"
from Bio import SeqIO
SeqIO.write(newsolid, output, "fastq")


Cheers, Dilara




More information about the Biopython mailing list