[Biopython] Parsing large blast files

Stefanie Lück lueck at ipk-gatersleben.de
Mon Apr 27 10:34:28 UTC 2009


Hi!

I want to blast many sequences against one DB and parse the outputs. At the moment, I do it in that way:

def blast_a_record(fasta_rec):
    open("to_blast.fasta", "w").write(str(fasta_rec))
    f = open('out.txt', 'a')
    my_blast_db = "\"G:\RNAiscan\\barleyv9\""
    my_blast_file = "G:\\RNAiscan\\to_blast.fasta" 
    my_blast_exe = "G:\\RNAiscan\\blastall.exe"
    result_handle, error_info = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file)
    blast_results = result_handle.read()
    save_file = open("my_blast.xml", "w")
    save_file.write(blast_results)
    save_file.close()
    result = open("my_blast.xml")
    blast_records = NCBIXML.parse(result)
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                # extract xml data from blast
                percent = float(100) * float(hsp.score) / float(len(fasta_rec))
                percent = round(percent, 0)
                if percent > 99.99:
                    primer_name = str(alignment.hit_def)
                    primer_length = str(alignment.length)
                    f.write(str(percent) + str(alignment.hit_def) + '\n')
    f.close()
def start_blast():
    handle = open("G:\RNAiscan\est.fasta", 'r')
    data = handle.readlines()
    for seq_record in data:
        rec = seq_record
        first_blast_hit = blast_a_record(rec)      
    handle.close()
start_blast()

This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() code from the tutrorial but I got the error message "Invalid header". 

Would NCBIStandalone.Iterator() be faster? Or, is there a way not to save a xml file or to save only the best hits (100 % match)? 

Kind regards
Stefanie



More information about the Biopython mailing list