[Biopython] Biopython local blastn query
Ara Kooser
ghashsnaga at gmail.com
Tue Jul 30 01:45:55 UTC 2013
Hello all,
I goofed up on curating accession numbers for part of my PhD project.
But I have the sequences in a big fasta file! I wrote a quick script that
read in one sequence at a time from the file, blasted it and then filtered
it based on 0 gaps and 100% id match. I did this for just the first 6
sequences as to not anger the NCBI. This worked great! But it's slow
(really slow) and I can't submit the whole file.
I installed a local blast db and wrote this script.(attached as
meta_data_local.py and the query file, clear_genus_level.fasta ):
########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord
nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")
#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")
#The main fasta file that needs to be blasted
fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file
for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml
result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
out="temp.xml")
stdout, stderr = result(stdin=first_seq.format("fasta"))
#Reading in the xml file.
#
record = open("temp.xml")
blast_record = NCBIXML.read(record)
for alignment in blast_record.alignments:
#Something goes wrong here. This part should only allow one seqeuence per
query to come
#through but they all do.
#When I run this same setup without the local database it works fine???
for hsp in alignment.hsps:
percent_id = (100*hsp.identities)/hsp.align_length
if hsp.gaps == 0 and percent_id == 100:
title_element = alignment.title.split()
print title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
+","+" "+str(alignment.length)+","\
+" "+str(hsp.gaps)+","+" "+str(hsp.identities) +"
"+str(percent_id)
file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
+alignment.accession+","+"
"+str(alignment.length)+","+\
" "+hsp.sbjct+"\n")
It works, kind of.
*What I thought I did:*
Grab a single sequence from the fasta file
Blast
Grab the xml and then filter based on gaps and percent id
Write stuff to file
Repeat
*What is happening (I think):*
Grab a single sequence from the fasta file
Blast
Grab the xml
Write stuff to file
Repeat
Is there a difference in the xml files from NCBI vs a local blast install
in terms of how biopython sees them?
Can anyone give me some pointers for how to solve this (did I goof up the
loop or how it iterates over the sequences)?
Is this the best way to go about solving this problem (local vs NCBI web)?
Thank you!
ara
--
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.
Geoscience website: http://www.tattooedscience.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: meta_data_local.py
Type: application/octet-stream
Size: 2123 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clear_genus_level.fasta
Type: application/octet-stream
Size: 8971 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0005.obj>
More information about the Biopython
mailing list