[BioPython] Blast XML parser
Michiel Jan Laurens de Hoon
mdehoon at c2b2.columbia.edu
Mon Dec 11 17:11:45 UTC 2006
The file format of Blast XML output changed with recent (>= 2.2.14 I
believe) versions of blast if multiple sequences are blasted at the same
time. Older versions of blast return an output file consisting of
several XML files concatenated together. Newer blast versions return one
XML file containing the blast results for all blasted sequences. Whereas
the advantage is that this is a valid XML file, it breaks
NCBIStandalone.Iterator, which looks for the start of a new XML file
when iterating over the blast results.
There are several bug reports now related to parsing multiple blast
records (bugs 1970, 2051, 2090).
I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it
changes the way NCBIXML is used, I was wondering if anybody has
objections to this approach.
Current usage of NCBIXML (single Blast record):
from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
parser = NCBIXML.BlastParser()
b_record = parser.parse(blast_out)
New usage of NCBIXML (single Blast records):
from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
b_records = NCBIXML.parse(blast_out)
b_record = b_records.next()
Current usage of NCBIXML (multiple Blast records):
from Bio.Blast import NCBIStandalone, NCBIXML
parser = NCBIXML.BlastParser()
blast_out = open("myblastoutput.xml")
for b_record in NCBIStandalone.Iterator(blast_out, parser):
#Do something with the record
New usage of NCBIXML (multiple Blast records):
from Bio.Blast import NCBIXML
blast_out = open("myblastoutput.xml")
b_records = NCBIXML.parse(blast_out)
for b_record in b_records:
#Do something with the record
Objections, anybody?
In case you want to try this, you can download the patch from Bugzilla
bug #1970.
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
More information about the Biopython
mailing list