[Biopython] Using Biopython to retrieve details on an unknown sequence by BLAST
Iddo Friedberg
idoerg at gmail.com
Mon Aug 18 15:29:58 UTC 2014
Lior,
As you have already identified, doing everything locally is probably best
if you have a large number of queries.
You can also keep BLASTing remotely, collecting all your hits, and then do
the following:
ncbiResults = Entrez.efetch(db="nucleotide", id=[gi1, gi2,gi3...],
rettype="gb", retmode="text")
Where you provide a list of all gi keys you collected. You will get a
list of Entrez records.
You can then process these records with SeqIO.parse() (as an iterator)
instead of SeqIO.read(), e.g.:
organisms = []
for seqrec in SeqIO.parse(ncbiResults):
organisms.append(seqrec.annotations['organism'])
Best,
Iddo
On Mon, Aug 18, 2014 at 10:14 AM, Lior Glick <liorglic at mail.tau.ac.il>
wrote:
> Dear Biopython list users,
>
> I'm using Biopython for the first time. I have sequence data from unknown
> organisms, and trying to use BLAST to tell which organism they are most
> likely to have come from. I wrote the following function to do that:
>
> def find_organism(file):"""
> Receives a fasta file with a single seq, and uses BLAST to find
> from which organism it was taken.
> """# get seq from fasta file
> seqRecord = SeqIO.read(file,"fasta")# run BLAST
> blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)# get first hit
> blastRecord = NCBIXML.read(blastResult)
> firstHit = blastRecord.alignments[0]# get hit's gi number
> title = firstHit.title
> gi = title.split("|")[1]# search NCBI for the gi number
> ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
> ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")# get organism
> annotatDict = ncbiResultSeqRec.annotationsreturn(annotatDict['organism'])
>
> It works fine, but takes about 2 minutes to retrieve the organism for
> each species, which seems very slow to me. I'm just wondering if I could do
> better. I know that I may create a local copy of NCBI to improve
> performance, and I might do that. However, I suspect that querying BLAST
> first, then take the id and use it to query Entrez is not the way to go. Do
> you have any other suggestions for improvements?
> Thanks!
>
> _______________________________________________
> Biopython mailing list - Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>
--
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140818/3abd3e31/attachment-0001.html>
More information about the Biopython
mailing list