[Biopython] Filter Blast results

Wibowo Arindrarto w.arindrarto at gmail.com
Tue Feb 26 17:27:21 UTC 2013


Hi Justin,

For your purpose, you can try using the SearchIO module
(http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc101), from
the latest Biopython (1.61).

Here's my attempt to have a similar working function:

from Bio import SearchIO, SeqIO

fasta_ids = set([x.id for x in SeqIO.parse('fasta', 'fasta')]) # get
all fasta IDs in a set

with open('no_hit', 'w') as no_hit, open('hit', 'w') as hit:
    for qresult in SearchIO.parse('blast_results.xml', 'blast-xml'):
        hits = set([x.id for x in qresult]) # get all the ID in a set
        present = fasta_ids.intersection(hits) # output all IDs
present in both sets

        if present: # set is not empty
            hit.write(qresult.id)
        else:
            no_hit.write(qresult.id)

On another note, if you are always checking against the same Fasta
file, you can try to create your own BLAST database consisting of only
those files and search against them, so any BLAST results you have
will always at least
contain one of the sequences in your FASTA file.

This makes the functions slightly simpler:

from Bio import SearchIO

with open('no_hit', 'w') as no_hit, open('hit', 'w') as hit:
    for qresult in SearchIO.parse('blast_results.xml', 'blast-xml'):
        # empty queries evaluate to False
        if qresult:
            hit.write(qresult.id)
        else:
            no_hit.write(qresult.id)

Both functions still require you to store all the FASTA IDs in memory,
but should be more reasonable than storing whole SeqRecord objects.

Hope that helps,
Bow

On Tue, Feb 26, 2013 at 5:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Feb 26, 2013 at 4:45 PM, Justin Gibbons <jgibbons1 at mail.usf.edu> wrote:
>> I know that there is already a script in the Cookbook for filtering out
>> blast queries with no hits, but it involves holding all of the sequence
>> objects in memory, which isn't good if you have to work with a lot of
>> sequences.
>
> Hi Justin,
>
> Which example are you referring too? It doesn't sound very efficient.
>
> There are some wiki pages with user contributed cookbook recipes:
> http://biopython.org/wiki/Category:Cookbook
>
> There is also the "Biopython Tutorial and Cookbook", online here:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> Thanks,
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list