[BioPython] plain txt blast output - xml instead
Peter
biopython at maubp.freeserve.co.uk
Tue Jun 20 13:52:48 UTC 2006
Peter wrote:
>>> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006],
>>> maybe they changed the XML format without telling anyone?
Michiel wrote:
>>It appears that the XML format did change.
>>With Blastp 2.2.14, multiple searches generate multiple
>><Iteration>...</Iteration> blocks, one for each search.
>>With an older Blastp, multiple searches effectively generate multiple
>>XML files (each with one <Iteration>...</Iteration> block). These files
>>are then concatenated into one output file. Biopython then parses this
>>file by looking for the beginning of each XML file in this output file.
>>
>>The new output is in a sense better because the output file is a valid
>>XML file. It may be that Biopython's XML parser ignores the <Iteration>
>>tags, since in the old format there was only one <Iteration> block
>>anyway, and therefore fails with the new format.
Rohini Damle wrote:
> So what do one need to do to make biopython working? Make changes in
> the XML parser so that it will consider one iteration for one result
> output?
Basically, yes, we need to change the BioPython NCBI Blast XML code
somehow - this might be best moved to the development mailing list.
Some relevant but probably slightly out of data documentation:
ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/README.blxml
Notice this appears to describe the <Iteration>...</Iteration> block as
follows:
BlastOutput_iter-num: the psi-blast iteration number (optional)
So whatever we do, we should have a look at the psi-blast output as well...
One idea I was thinking about is to modify the existing Blast XML parser
to specify WHICH iteratation number it should parse (ignoring the rest).
An invalid iteration number would throw a new exception error.
Then, a new Blast XML iterator would call the parser repeatedly
incrementing the iteration number until the "invalid iteration number"
error was raised, which would signal the end.
Note that with the "old style concatenated XML entries" we could parse
each entry one by one, without having to load the entire XML file into
memory at once. I don't think that will be possible with the new style
XML files.
Peter
More information about the Biopython
mailing list