[Biopython] Using Biopython to retrieve details on an unknown sequence by BLAST

Iddo Friedberg idoerg at gmail.com
Mon Aug 18 15:29:58 UTC 2014


Lior,

As you have already identified, doing everything locally is probably best
if you have a large number of queries.

You can also keep BLASTing remotely, collecting all your hits, and then do
the following:

ncbiResults = Entrez.efetch(db="nucleotide", id=[gi1, gi2,gi3...],
rettype="gb", retmode="text")

Where you provide a list of all gi keys you collected. You will get a
list of Entrez records.


You can then process these records with SeqIO.parse() (as an iterator)
instead of SeqIO.read(), e.g.:

organisms = []

for seqrec in SeqIO.parse(ncbiResults):

     organisms.append(seqrec.annotations['organism'])


Best,

Iddo

On Mon, Aug 18, 2014 at 10:14 AM, Lior Glick <liorglic at mail.tau.ac.il>
wrote:

> Dear Biopython list users,
>
> I'm using Biopython for the first time. I have sequence data from unknown
> organisms, and trying to use BLAST to tell which organism they are most
> likely to have come from. I wrote the following function to do that:
>
> def find_organism(file):"""
> Receives a fasta file with a single seq, and uses BLAST to find
> from which organism it was taken.
> """# get seq from fasta file
> seqRecord = SeqIO.read(file,"fasta")# run BLAST
> blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)# get first hit
> blastRecord = NCBIXML.read(blastResult)
> firstHit = blastRecord.alignments[0]# get hit's gi number
> title = firstHit.title
> gi = title.split("|")[1]# search NCBI for the gi number
> ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
> ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")# get organism
> annotatDict = ncbiResultSeqRec.annotationsreturn(annotatDict['organism'])
>
>  It works fine, but takes about 2 minutes to retrieve the organism for
> each species, which seems very slow to me. I'm just wondering if I could do
> better. I know that I may create a local copy of NCBI to improve
> performance, and I might do that. However, I suspect that querying BLAST
> first, then take the id and use it to query Entrez is not the way to go. Do
> you have any other suggestions for improvements?
> Thanks!
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>



-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140818/3abd3e31/attachment-0001.html>


More information about the Biopython mailing list