[Biopython] Using Biopython to retrieve details on an unknown sequence by BLAST

Mon Aug 18 15:30:37 UTC 2014

On Mon, Aug 18, 2014 at 3:14 PM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> Dear Biopython list users,
>
> I'm using Biopython for the first time. I have sequence data from unknown
> organisms, and trying to use BLAST to tell which organism they are most
> likely to have come from. I wrote the following function to do that:
>
> def find_organism(file):
> """
> Receives a fasta file with a single seq, and uses BLAST to find
> from which organism it was taken.
> """
>...
>
> It works fine, but takes about 2 minutes to retrieve the organism for each
> species, which seems very slow to me. I'm just wondering if I could do
> better. I know that I may create a local copy of NCBI to improve
> performance, and I might do that. However, I suspect that querying BLAST
> first, then take the id and use it to query Entrez is not the way to go. Do
> you have any other suggestions for improvements?
> Thanks!

I would download the NT database and standalone BLAST+ from the
NCBI and run BLASTN locally, requesting tabular output include the
optional taxonomy fields:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/
ftp://ftp.ncbi.nlm.nih.gov/blast/db/
http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html

Peter