[Biopython] getting multiple BLAST (NCBIWWW) queries to work

Tue Mar 29 16:43:35 EDT 2011

OK, when I try to create a .fasta file with just the first sequence in
opuntia, I get no hits. However, when I just copy paste the nucleotide
sequence and copy paste, I get 50 hits!  This is consistent with what
happens with copy pasting the first opuntia sequence into the NCBI
BLAST web interafce, though there I obtain 110 hits for intronic
sequences in Opuntia chloroplast and chloroplasts. As a secondary
point I also find it curious the result with using NCBIWWW is limited
to 50 hits (I thought it was 500 by default). But what is more
problematic than the fact that I get no hits when using a FASTA file
with only a single sequence, when clearly there are some very high
homology hits present in nr.

This is my code from beginning to end, where the file opuntia1.fasta
is a file containing only the 1st sequence from opuntia.fasta, and
when using the line for opuntia1.fasta it resulted in no hits. I am
using BioPython 1.5.3 and Python 2.6 on Ubuntu if this has any effect
on the results. I also tried it by obtaining a single sequence from
SeqIO.parse and then obtaining the Seq of this sequence, and it also
gave 50 hits. So it's basically just with using a FASTA file handle
that I can't get it to work.

#!/usr/bin/python
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
result_handle = NCBIWWW.qblast("blastn", "nr",
"TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGA\CCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCTCATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAAACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCAGA")

#result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia1.fasta",
"r"))
blast_record = NCBIXML.read(result_handle)

for description in blast_record.descriptions:
    print description;

#end of code.

On Tue, Mar 29, 2011 at 2:07 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 29, 2011 at 6:55 PM, James Wagner <jamesrwagner at gmail.com> wrote:
>> Hello:
>>
>> I was trying just as a proof of concept to do an NCBI WWW BLAST query
>> with a FASTA file containing more than one sequence (but still a small
>> number of sequences).
>>
>> I tried with the opuntia.fasta file from the website, and set it up as follows:
>>
>> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r"))
>> blast_records = NCBIXML.parse(result_handle)
>>
>> then I try:
>>
>> for record in blast_records:
>>      print record.alignments
>>
>> and I obtain:
>> []
>>
>>
>> Surely at the very least since there were 7 sequences in this file, I
>> should get 7 empty lists, assuming of course none of the sequences
>> gives a hit in nr, which I am sure is not the case either?
>
> Not necessarily, the NCBI may have fixed this but for a long time if
> you had say 7 queries but only 2 gave hits, stand alone BLAST's
> XML output would only contain those 2 hits. There would be nothing
> at all from the 5 hit less queries. This was/is very annoying, but
> right now I'm not sure if they have fixed this or not.
>
> Try getting back the results as plain text and manually inspect them.
> In the plain text output all the queries appear, and there is a clear
> "no hits found" message.
>
>> What is still missing? I realize I could use SeqIO.parse to obtain
>> each sequence from the FASTA file and do a separate qblast, but surely
>> doing this separately for each protein would create unnecessary
>> overhead with the network traffic compared to somehow sending off all
>> the protein queries at once?
>
> Yes, in theory a single large query should have less overhead
> than individual queries. Personally I'd just use standalone BLAST
> and run it locally if I had more than a few queries.
>
> Peter
>