[Biopython] Parsing large blast files
Stefanie Lück
lueck at ipk-gatersleben.de
Mon Apr 27 06:34:28 EDT 2009
Hi!
I want to blast many sequences against one DB and parse the outputs. At the moment, I do it in that way:
def blast_a_record(fasta_rec):
open("to_blast.fasta", "w").write(str(fasta_rec))
f = open('out.txt', 'a')
my_blast_db = "\"G:\RNAiscan\\barleyv9\""
my_blast_file = "G:\\RNAiscan\\to_blast.fasta"
my_blast_exe = "G:\\RNAiscan\\blastall.exe"
result_handle, error_info = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file)
blast_results = result_handle.read()
save_file = open("my_blast.xml", "w")
save_file.write(blast_results)
save_file.close()
result = open("my_blast.xml")
blast_records = NCBIXML.parse(result)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
# extract xml data from blast
percent = float(100) * float(hsp.score) / float(len(fasta_rec))
percent = round(percent, 0)
if percent > 99.99:
primer_name = str(alignment.hit_def)
primer_length = str(alignment.length)
f.write(str(percent) + str(alignment.hit_def) + '\n')
f.close()
def start_blast():
handle = open("G:\RNAiscan\est.fasta", 'r')
data = handle.readlines()
for seq_record in data:
rec = seq_record
first_blast_hit = blast_a_record(rec)
handle.close()
start_blast()
This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() code from the tutrorial but I got the error message "Invalid header".
Would NCBIStandalone.Iterator() be faster? Or, is there a way not to save a xml file or to save only the best hits (100 % match)?
Kind regards
Stefanie
More information about the Biopython
mailing list