[BioPython] Blast XML parser

Michiel Jan Laurens de Hoon mdehoon at c2b2.columbia.edu
Mon Dec 11 17:11:45 UTC 2006


The file format of Blast XML output changed with recent (>= 2.2.14 I 
believe) versions of blast if multiple sequences are blasted at the same 
time. Older versions of blast return an output file consisting of 
several XML files concatenated together. Newer blast versions return one 
XML file containing the blast results for all blasted sequences. Whereas 
the advantage is that this is a valid XML file, it breaks 
NCBIStandalone.Iterator, which looks for the start of a new XML file 
when iterating over the blast results.

There are several bug reports now related to parsing multiple blast 
records (bugs 1970, 2051, 2090).

I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it 
changes the way NCBIXML is used, I was wondering if anybody has 
objections to this approach.


Current usage of NCBIXML (single Blast record):

from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
parser = NCBIXML.BlastParser()
b_record = parser.parse(blast_out)


New usage of NCBIXML (single Blast records):

from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
b_records = NCBIXML.parse(blast_out)
b_record = b_records.next()



Current usage of NCBIXML (multiple Blast records):

from Bio.Blast import NCBIStandalone, NCBIXML
parser = NCBIXML.BlastParser()
blast_out = open("myblastoutput.xml")
for b_record in NCBIStandalone.Iterator(blast_out, parser):
     #Do something with the record


New usage of NCBIXML (multiple Blast records):

from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
b_records = NCBIXML.parse(blast_out)
for b_record in b_records:
     #Do something with the record


Objections, anybody?
In case you want to try this, you can download the patch from Bugzilla 
bug #1970.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032



More information about the Biopython mailing list