[BioPython] Parstin a remote Blast output

Peter biopython at maubp.freeserve.co.uk
Wed May 21 08:46:13 UTC 2008


On Wed, May 21, 2008 at 12:28 AM, Raul Guerra <colochera at gmail.com> wrote:
> Thank you to everyone who replied my last post. I am sorry to bother you
> again with a question. Thank you in advance for your time.
>
> I am trying to parse the output from:
>
> result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr,
> entrez_query='"Arabidopsis thaliana" [ORGN]')
>
> where fastaStr is a string in the fasta format.

As you have discovered, this will return XML by default (since
Biopython 1.41).  You will get back a handle object (of some sort).

> I tried to follow the logic of the program and I found that NCBIWWW.qblast()
> is outputing a XML file, and for some reason NCBIWWW.BlastParser() is
> expecting a HTML file. That is my guess of what is going wrong. So what I
> did was to use the parser in NCBIXML.

Well done on working this out.  Can I ask you why you tried the
version using the plain text parser?  I thought we'd updated all our
documentation on this but perhaps we missed something.  See BLAST
Chapter of the tutorial,
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

> So I ran the following
>
>    result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr,
> entrez_query='"Arabidopsis thaliana" [ORGN]')
>
>    blast_records = NCBIXML.parse(result_handle)
>
> and it works fine (at least I do not get errors), but I have no idea on what
> type of object blast_records is. I tried the following

Using NCBIXML.parse(result_handle) will return an iterator, but it
doesn't actually start parsing the file until you call the next()
method, which is usally done in a for loop.

> next = blast_records.next()
>
> and got the following error:
>
> Traceback (most recent call last):
> ...
>  File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in
> _end_BlastOutput_version
>    self._header.date = self._value.split()[2][1:-1]
> IndexError: list index out of range
>
> I have not been able to understand what is going on here.

Sadly the NCBI changed their output format slightly, and Biopython
couldn't cope.  We've fixed this now (Bug 2499), but you'll have to
update your installation.  See here for details:
http://bugzilla.open-bio.org/show_bug.cgi?id=2499

> I just want to parse the results I get from:
>
> result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr,
> entrez_query='"Arabidopsis thaliana" [ORGN]')
>
> Any ideas?

You're very close.  I suggest updating the NCBIXML file to cope with
the current version of BLAST that the NCBI is using online (2.2.18+),
and then using the XML parser:

from Bio import NCBIWWW, NCBIXML
result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr,
entrez_query='"Arabidopsis thaliana" [ORGN]')
for record in NCBIXML.parse(result_handle) :
   #Do something with the blast result

Peter



More information about the Biopython mailing list