[Biopython] Filter Blast results

Justin Gibbons jgibbons1 at mail.usf.edu
Tue Feb 26 16:45:03 UTC 2013


I know that there is already a script in the Cookbook for filtering out
blast queries with no hits, but it involves holding all of the sequence
objects in memory, which isn't good if you have to work with a lot of
sequences. I came up with the following function, which works, but I would
appreciate any input for how to improve it. In particular I don't like that
I am appending the sequence objects to file and would like to know of any
alternatives.

The main function is:

def filter_no_hits(blast_xml_results, source_fasta, file_format,
no_hit_file, hit_file):
    """Scans Blast XML results and if the query sequence has no hits prints
the sequence
        record in the no_hit_file, otherwise in the hit_file. The
source_fasta is the file
        that was used to perform the blast search and is used to retrieve
the sequence record"""

    result_handle=open(blast_xml_results) #open the xml file
    blast_records=NCBIXML.parse(result_handle) #create the generator object
    indexed_fasta=create_indexed_fasta(source_fasta, file_format) #create
the indexed file object

    for record in blast_records:
        hit_def_list=blast_xml_hit_def(record) #returns list of hit_def
results
        record_id=get_id_str_from_desc(record.query) #get the record ID to
search the indexed file
        record_object=indexed_fasta.get_raw(record_id) #Use the sequence ID
to get the sequence record

        if is_list_null(hit_def_list): #if no hits
            append_to_file(no_hit_file, record_object)
        else: #if hits
            append_to_file(hit_file, record_object)
    result_handle.close()

Sub-functions:

def create_indexed_fasta(path, file_format):
    """Makes a fasta file searchable like a dictionary with the sequence Id
    as the key"""
    return SeqIO.index(path, file_format)

def blast_xml_hit_def(record):
    """Returns a list of hit_def for a record from a NCBI blast XML
report"""
    hit_def_list=[]
    for alignment in record.alignments:
        hit_def_list.append(alignment.hit_def)
    return hit_def_list

def get_id_str_from_desc(desc):
    """Returns the Id from a fasta record description"""
    parts=desc.split(" ")
    return parts[0]

def is_list_null(lst):
    """Returns True if list is empty and False otherwise"""
    if len(lst)==0:
        return True
    else:
        return False

def append_to_file(path, string):
    with open(path, 'a') as f:
        f.write(string)

def record_counter(path, file_format):
    """Input a file path and the format of the file and it returns the
    number of records in the file"""
    counter=0
    for seq_record in SeqIO.parse(path, file_format):
        counter+=1
    print "%s contains %i records" %(path, counter)
    return counter

Thank you

Justin Gibbons



More information about the Biopython mailing list