[BioPython] Blast XML parser

Christof Winter winter at biotec.tu-dresden.de
Tue Dec 12 16:42:39 UTC 2006


Dear Michiel:

I just tested the patched NCBIXML.py with the XML format output of multiple sequences 
blasted online at the NCBI BLAST website.

The result looks fine, however, it seems that query IDs and definitions are not parsed. 
This is probably because of NCBI's change of tag names from

<BlastOutput_query-ID> and <BlastOutput_query-def>
in the concatenated, old XML format to

<Iteration_query-ID> and <Iteration_query-def>
in the new, valid XML format.

Concerning the new syntax, I would prefer a unified syntax for parsing of both XML 
formats, and I would like to vote for Peter's "nice idea" in his comment #6 in 
http://bugzilla.open-bio.org/show_bug.cgi?id=1970). Running the same code on different 
machines with different local BLAST versions constantly gives me a headache when parsing 
the results. As long as these different BLAST versions are out there, people will run into 
problems, and fill the BioPython discussion lists.

Cheers,
Christof


Michiel Jan Laurens de Hoon wrote:
> The file format of Blast XML output changed with recent (>= 2.2.14 I 
> believe) versions of blast if multiple sequences are blasted at the same 
> time. Older versions of blast return an output file consisting of 
> several XML files concatenated together. Newer blast versions return one 
> XML file containing the blast results for all blasted sequences. Whereas 
> the advantage is that this is a valid XML file, it breaks 
> NCBIStandalone.Iterator, which looks for the start of a new XML file 
> when iterating over the blast results.
> 
> There are several bug reports now related to parsing multiple blast 
> records (bugs 1970, 2051, 2090).
> 
> I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it 
> changes the way NCBIXML is used, I was wondering if anybody has 
> objections to this approach.
> 
> 
> Current usage of NCBIXML (single Blast record):
> 
> from Bio.Blast import NCBIXML
> blast_out = open("myblastoutput.xml")
> parser = NCBIXML.BlastParser()
> b_record = parser.parse(blast_out)
> 
> 
> New usage of NCBIXML (single Blast records):
> 
> from Bio.Blast import NCBIXML
> blast_out = open("myblastoutput.xml")
> b_records = NCBIXML.parse(blast_out)
> b_record = b_records.next()
> 
> 
> 
> Current usage of NCBIXML (multiple Blast records):
> 
> from Bio.Blast import NCBIStandalone, NCBIXML
> parser = NCBIXML.BlastParser()
> blast_out = open("myblastoutput.xml")
> for b_record in NCBIStandalone.Iterator(blast_out, parser):
>      #Do something with the record
> 
> 
> New usage of NCBIXML (multiple Blast records):
> 
> from Bio.Blast import NCBIXML
> blast_out = open("myblastoutput.xml")
> b_records = NCBIXML.parse(blast_out)
> for b_record in b_records:
>      #Do something with the record
> 
> 
> Objections, anybody?
> In case you want to try this, you can download the patch from Bugzilla 
> bug #1970.
> 
> --Michiel.
> 
> 

-- 
Christof Winter
Bioinformatics Group
TU Dresden
Tatzberg 47-51
01307 Dresden, Germany

Phone: +49 351 463 40065
EMail: winter at biotec.tu-dresden.de



More information about the Biopython mailing list