[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Fri Sep 14 22:43:12 EDT 2012

Last weekend I also talked with Peter during his visit to Tokyo about the Blast (human-readable) plain-text parser. We could see three scenarios in which the plain-text parser has an advantage over the XML parser (Peter please correct me if I am missing something from our discussion):

1) The file size of Blast plain-text output may be smaller than that of Blast XML output;
2) Users may want to look at the Blast output by eye in addition to parsing it with Biopython;
3) Users may have stacks of old Blast output files in plain-text format that they still want to use.

Each of these points can be addressed without a Blast plain-text parser:
1) After zipping, we expect little difference in file size between plain-text output and XML output;
2) If we add a function to Biopython that generates Blast plain-text output (or something close to it) from Blast XML output, then a user can generate the Blast output in XML format, parse it with Biopython, optionally filter it, and then generate the corresponding plain-text output;
3) If this is really an issue, then we could create some standalone scripts (available from the Biopython website) that parses plain-text Blast output and generates the corresponding XML output. These scripts will be much easier than the current plain-text parser in Biopython, because we can create such a script for each version of Blast separately (of course this is only done if the need actually arises). The XML output can then be parsed by Biopython.

Are there any other cases in which the plain-text parser is needed?
Or where our proposed solutions to the three points above are not sufficient?
If not, then I suggest we implement the plain-text generator in (2), and upgrade the PendingDeprecationWarning in Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning.

Best,
-Michiel

--- On Thu, 9/13/12, Fields, Christopher J <cjfields at illinois.edu> wrote:

> From: Fields, Christopher J <cjfields at illinois.edu>
> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "BioPython Mailing List" <biopython at lists.open-bio.org>, "Martin Mokrejs" <mmokrejs at fold.natur.cuni.cz>
> Date: Thursday, September 13, 2012, 9:32 PM
> On Sep 13, 2012, at 7:37 PM, Michiel
> de Hoon <mjldehoon at yahoo.com>
>  wrote:
> 
> > --- On Thu, 9/13/12, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz>
> wrote:
> >> P.S.: And yes, I would love to parse blastn
> plaintext output
> >> or some other more compact one, the XML is really
> an overkill.
> > 
> > What exactly is the advantage of plain text parsing
> compared to XML? File size?
> > 
> > Best,
> > -Michiel.
> 
> There isn't any.  In fact, NCBI has consistently stated
> that one should never rely on parsing BLAST text output,
> primarily b/c they reserve the right to make changes to the
> output at any given point, whereas XML output should remain
> stable.  As someone who has taken care of legacy BLAST
> code for a number of years (BioPerl), I can state that is
> fairly close to the truth (the caveat being they have made
> changes that break some XML parsing, but they do try to fix
> them).  BLAST XML has simply been much easier to deal
> with in terms of fixing issues than text.
> 
> chris
> 
>