[Biopython-dev] Blast parsers and records

Mon May 31 05:10:43 EDT 2010

On Sat, May 29, 2010 at 4:23 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi everybody,
>
> With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+
> suite of Blast programs, maybe this is a good time to tackle some older bugs related
> to Blast output parsing in Biopython:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176
> (inconsistencies in the output of different Blast parsers)
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
> (inconsistencies between Psi-blast parsers)
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2319
> (parsing Blast table output)
>
> and more generally think about the design of the Blast record class and Blast
> parsing. In my opinion, these are the major issues:
>
> 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML,
> Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should
> have one read() function and one parse() function under Bio.Blast, with
> arguments specifying which format the Blast output is in.

I see the point, but some of these parsers give very different output
(your points 2 and 3).

> 2) Blast records produced by any of the parsers should be consistent
> with each other.

See also (3) below.

> As XML output by blast and psi-blast follow the same
> DTD, we should be able to represent both by a single Record class.

I think this was a short term hack by the NCBI - and rules out having
a single XML file hold multiple PSI queries and their iterations.

> 3) Different parsers should store information in this Record class in
> the same way.

Where possible, yes, but different BLAST output formats contain
different information - e.g. some contain the hit sequences while
others do not.

> 4) The current Blast record stores its information in attributes. If you
> use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains
> the necessary DTDs to do so), the information is stored in dictionaries.
> This has some advantages. For example, it allows you to use
> record.keys() to find out what the record contains. Ideally, I think
> that a Blast Record class should inherit from a dictionary.

As already pointed out, it has disadvantages too. With traditional
attributes or properties you can use dir(record) and also setup
docstrings for properties etc. I think they are clearer than dictionary
keys.

I would look at a base BLAST record (covering the core information
found in all formats including tabular) with subclasses for the richer
output formats (default plain text and XML).

> 5) We should be able to print a Blast record object to generate
> output that is close to the plain-text output generated by blast.
> This would allow us to generate and store Blast output as XML,
> and to convert the output to plain-text to make it more human-
> readable.

Nice - but that could make the str(record) output very long.

> 6) The current Blast record inherits from Bio.Blast.Record.Header,
> Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters.
> I don't see the rationale for this inheritance, and I think we should
> remove it.

I agree this is a rather odd design choice (even if the three sections
did map onto three parts of the plain text output). We can probable
do this without changing the exposed Blast record behaviour.

> Any comments, suggestions (in particular about by proposal to
> have a Blast Record class that inherits from a dictionary? Btw, to
> avoid breaking scripts, I propose that any changes to the Blast
> record and parser are implemented separately from the existing
> parsers and record, and to leave those untouched.

Some of these suggestions like (5) and (6) could be done to the
existing BLAST parsers and objects, and would seem a good idea.

Regarding the main proposal (1), I would be more interested in
more ambitious proposal along the lines of BioPerl's SearchIO
covering not just BLAST but also FASTA, BLAT, HMMER and
any other "pairwise searches" (and potentially we could share code
for this with AlignIO for pairwise alignment formats).
This is more work of course, and could come later.
http://www.bioperl.org/wiki/HOWTO:SearchIO

Peter