[Biopython] Finally deprecating the plain text BLAST parser?

Peter Cock p.j.a.cock at googlemail.com
Sat Sep 15 10:49:59 UTC 2012


Hello all,

I've retitled this from Martin's thread initially about the BLAST XML parser:
http://lists.open-bio.org/pipermail/biopython/2012-September/008154.html
...
http://lists.open-bio.org/pipermail/biopython/2012-September/008164.html
http://lists.open-bio.org/pipermail/biopython/2012-September/008165.html

The topic shifted and an important question raised was:

Should we finally deprecate the 'obsolete' plain text BLAST parser?

So - is anyone on the list still using this file format, and why?

[ Speak now or forever hold your peace ;) ]

Thanks,

Peter

On Sat, Sep 15, 2012 at 11:37 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Last weekend I also talked with Peter during his visit to Tokyo about the
>> Blast (human-readable) plain-text parser. We could see three scenarios in
>> which the plain-text parser has an advantage over the XML parser (Peter
>> please correct me if I am missing something from our discussion):
>>
>> 1) The file size of Blast plain-text output may be smaller than that of
>> Blast XML output;
>> 2) Users may want to look at the Blast output by eye in addition to
>> parsing it with Biopython;
>> 3) Users may have stacks of old Blast output files in plain-text format
>> that they still want to use.
>
> Maybe also (3a) The user may want plain-text BLAST output to
> input into another tool as well as Biopython?
>
>>
>> Each of these points can be addressed without a Blast plain-text parser:
>> 1) After zipping, we expect little difference in file size between
>> plain-text output and XML output;
>
> However there would be a speed penalty - compression, then
> decompression, and perhaps in XML versus text parsing.
>
>> 2) If we add a function to Biopython that generates Blast plain-text
>> output (or something close to it) from Blast XML output, then a user can
>> generate the Blast output in XML format, parse it with Biopython, optionally
>> filter it, and then generate the corresponding plain-text output;
>
> The new 'SearchIO' results objects str/repr should be familiar to
> anyone who has looked at the plain text BLAST output - but
> not identical. We could apply some of these improvements
> to the current BLAST parsers, but I favour aiming to simply
> deprecate them in favour of 'SearchIO' (namespace to be
> decided).
>
> However, we certainly could try and offer a plain-text BLAST
> output format from 'SearchIO', although IIRC Bow has not tried
> that yet. It shouldn't be too complicated - unless you aim for
> 100% agreement with the latest BLAST output (moving target).
>
>> 3) If this is really an issue, then we could create some standalone
>> scripts (available from the Biopython website) that parses plain-text Blast
>> output and generates the corresponding XML output. These scripts will be
>> much easier than the current plain-text parser in Biopython, because we can
>> create such a script for each version of Blast separately (of course this is
>> only done if the need actually arises). The XML output can then be parsed by
>> Biopython.
>
> I was not convinced that this would actually save any effort over
> continuing to tweak the current (complex but flexible) plain text
> parser.
>
>> Are there any other cases in which the plain-text parser is needed?
>> Or where our proposed solutions to the three points above are not
>> sufficient?
>
> Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML,
> and (d) column rich tabular output might be worthwhile. There may
> be a case for parsing plain-text on the basis of speed.
>
>> If not, then I suggest we implement the plain-text generator in (2),
>>
>
> I certainly this adding plain-text output to 'SearchIO' would be
> useful.
>
>> and upgrade the PendingDeprecationWarning in
>> Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning.
>
> Another idea we touched on was deprecating the current old,
> complex but flexible plain text parser while adding a new simpler
> plain text parser as part of 'SearchIO'. Here we could target only
> the recent BLAST+ output (and perhaps if not so different the
> final 'legacy' BLAST release), and not worry about all the variants
> the NCBI have produced over the years. I would hope this would
> also be faster [especially as currently 'SearchIO' supports parsing
> plain text BLAST on top of the existing old parser].
>
> This boils down to a key question: How many people still want
> to use the plain-text output and why? I believe that for most
> use cases the tabular or XML output is better (covering simple
> needs, and full parsing of every detail respectively).
>
> e.g. It sounds like for Martin's example, the tabular output would
> be a perfect match.
>
> [Although, as I noted above, parsing the XML, especially if
> compressed, may not be as fast as parsing plain text?]
>
> While writing this email I was trying to recall when I last used
> the plain text output - and the only situation I could think of
> in the last year or so was in order to have something human
> readable to show a collaborator. Here XML to plain text BLAST
> would have been fine.
>
> Peter



More information about the Biopython mailing list