[Biopython] Using Biopython to retrieve details on an unknown sequence by BLAST

Lior Glick liorglic at mail.tau.ac.il
Mon Aug 18 14:14:49 UTC 2014


Dear Biopython list users,

I'm using Biopython for the first time. I have sequence data from unknown
organisms, and trying to use BLAST to tell which organism they are most
likely to have come from. I wrote the following function to do that:

def find_organism(file):"""
Receives a fasta file with a single seq, and uses BLAST to find
from which organism it was taken.
"""# get seq from fasta file
seqRecord = SeqIO.read(file,"fasta")# run BLAST
blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)# get first hit
blastRecord = NCBIXML.read(blastResult)
firstHit = blastRecord.alignments[0]# get hit's gi number
title = firstHit.title
gi = title.split("|")[1]# search NCBI for the gi number
ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")# get organism
annotatDict = ncbiResultSeqRec.annotationsreturn(annotatDict['organism'])

It works fine, but takes about 2 minutes to retrieve the organism for each
species, which seems very slow to me. I'm just wondering if I could do
better. I know that I may create a local copy of NCBI to improve
performance, and I might do that. However, I suspect that querying BLAST
first, then take the id and use it to query Entrez is not the way to go. Do
you have any other suggestions for improvements?
Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140818/860e3bb0/attachment.html>


More information about the Biopython mailing list