[Biopython-dev] Blast parsers and records
Michiel de Hoon
mjldehoon at yahoo.com
Sat May 29 03:23:21 UTC 2010
Hi everybody,
With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use its new Blast+ suite of Blast programs, maybe this is a good time to tackle some older bugs related to Blast output parsing in Biopython:
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
(inconsistencies in the output of different Blast parsers)
http://bugzilla.open-bio.org/show_bug.cgi?id=2929
(inconsistencies between Psi-blast parsers)
http://bugzilla.open-bio.org/show_bug.cgi?id=2319
(parsing Blast table output)
and more generally think about the design of the Blast record class and Blast parsing. In my opinion, these are the major issues:
1) Blast parsers are located in several modules (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we should have one read() function and one parse() function under Bio.Blast, with arguments specifying which format the Blast output is in.
2) Blast records produced by any of the parsers should be consistent with each other. As XML output by blast and psi-blast follow the same DTD, we should be able to represent both by a single Record class.
3) Different parsers should store information in this Record class in the same way.
4) The current Blast record stores its information in attributes. If you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the necessary DTDs to do so), the information is stored in dictionaries. This has some advantages. For example, it allows you to use record.keys() to find out what the record contains. Ideally, I think that a Blast Record class should inherit from a dictionary.
5) We should be able to print a Blast record object to generate output that is close to the plain-text output generated by blast. This would allow us to generate and store Blast output as XML, and to convert the output to plain-text to make it more human-readable.
6) The current Blast record inherits from Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I don't see the rationale for this inheritance, and I think we should remove it.
Any comments, suggestions (in particular about by proposal to have a Blast Record class that inherits from a dictionary? Btw, to avoid breaking scripts, I propose that any changes to the Blast record and parser are implemented separately from the existing parsers and record, and to leave those untouched.
--Michiel.
More information about the Biopython-dev
mailing list