[Biopython] Biopython local blastn query

Ara Kooser ghashsnaga at gmail.com
Tue Jul 30 01:45:55 UTC 2013


Hello all,

   I goofed up on curating accession numbers for part of my PhD project.
But I have the sequences in a big fasta file! I wrote a quick script that
read in one sequence at a time from the file, blasted it and then filtered
it based on 0 gaps and 100% id match. I did this for just the first 6
sequences as to not anger the NCBI. This worked great! But it's slow
(really slow) and I can't submit the whole file.

 I installed a local blast db and wrote this script.(attached as
meta_data_local.py and the query file, clear_genus_level.fasta ):

########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.

from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord

nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")

#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")

#The main fasta file that needs to be blasted

fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file

for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml
    result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
                                   out="temp.xml")
    stdout, stderr = result(stdin=first_seq.format("fasta"))

#Reading in the xml file.
#

    record = open("temp.xml")
    blast_record = NCBIXML.read(record)


    for alignment in blast_record.alignments:
#Something goes wrong here. This part should only allow one seqeuence per
query to come
#through but they all do.
#When I run this same setup without the local database it works fine???

        for hsp in alignment.hsps:

            percent_id = (100*hsp.identities)/hsp.align_length
            if hsp.gaps == 0 and percent_id == 100:
                title_element = alignment.title.split()

                print  title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
                  +","+" "+str(alignment.length)+","\
                    +" "+str(hsp.gaps)+","+" "+str(hsp.identities) +"
"+str(percent_id)

                file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
                               +alignment.accession+","+"
"+str(alignment.length)+","+\
                               " "+hsp.sbjct+"\n")

It works, kind of.

*What I thought I did:*
Grab a single sequence from the fasta file
Blast
Grab the xml and then filter based on gaps and percent id
Write stuff to file
Repeat

*What is happening (I think):*
Grab a single sequence from the fasta file
Blast
Grab the xml
Write stuff to file
Repeat


Is there a difference in the xml files from NCBI vs a local blast install
in terms of how biopython sees them?

Can anyone give me some pointers for how to solve this (did I goof up the
loop or how it iterates over the sequences)?

Is this the best way to go about solving this problem (local vs NCBI web)?


Thank you!
ara






-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: meta_data_local.py
Type: application/octet-stream
Size: 2123 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clear_genus_level.fasta
Type: application/octet-stream
Size: 8971 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0005.obj>


More information about the Biopython mailing list