[BioPython] plain txt blast output - xml instead

Peter biopython at maubp.freeserve.co.uk
Tue Jun 20 13:52:48 UTC 2006


Peter wrote:
>>> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], 
 >>> maybe they changed the XML format without telling anyone?

Michiel wrote:
>>It appears that the XML format did change.
>>With Blastp 2.2.14, multiple searches generate multiple
>><Iteration>...</Iteration> blocks, one for each search.
>>With an older Blastp, multiple searches effectively generate multiple
>>XML files (each with one <Iteration>...</Iteration> block). These files
>>are then concatenated into one output file. Biopython then parses this
>>file by looking for the beginning of each XML file in this output file.
>>
>>The new output is in a sense better because the output file is a valid
>>XML file. It may be that Biopython's XML parser ignores the <Iteration>
>>tags, since in the old format there was only one <Iteration> block
>>anyway, and therefore fails with the new format.

Rohini Damle wrote:
 > So what do one need to do to make biopython working?  Make changes in
 > the XML parser so that it will consider one iteration for one result
 > output?

Basically, yes, we need to change the BioPython NCBI Blast XML code 
somehow - this might be best moved to the development mailing list.

Some relevant but probably slightly out of data documentation:

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/README.blxml

Notice this appears to describe the <Iteration>...</Iteration> block as 
follows:

BlastOutput_iter-num: the psi-blast iteration number (optional)

So whatever we do, we should have a look at the psi-blast output as well...

One idea I was thinking about is to modify the existing Blast XML parser 
to specify WHICH iteratation number it should parse (ignoring the rest). 
  An invalid iteration number would throw a new exception error.

Then, a new Blast XML iterator would call the parser repeatedly 
incrementing the iteration number until the "invalid iteration number" 
error was raised, which would signal the end.

Note that with the "old style concatenated XML entries" we could parse 
each entry one by one, without having to load the entire XML file into 
memory at once.  I don't think that will be possible with the new style 
XML files.

Peter




More information about the Biopython mailing list