[Biopython-dev] Blast parsers and records
Laurent Gautier
lgautier at gmail.com
Sat May 29 18:29:00 UTC 2010
Hi,
Few thoughts below:
On 5/29/10 6:00 PM, biopython-dev-request at lists.open-bio.org wrote:
> Hi everybody,
>
> With Biopython 1.54 out (thanks Peter!), and NCBI encouraging to use
> its new Blast+ suite of Blast programs, maybe this is a good time to
> tackle some older bugs related to Blast output parsing in Biopython:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 (inconsistencies in
> the output of different Blast parsers)
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 (inconsistencies
> between Psi-blast parsers)
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2319 (parsing Blast
> table output)
>
> and more generally think about the design of the Blast record class
> and Blast parsing. In my opinion, these are the major issues:
>
> 1) Blast parsers are located in several modules (Bio.Blast.NCBIXML,
> Bio.Blast.NCBIStandalone, Bio.Blast.ParseBlastTable). I think we
> should have one read() function and one parse() function under
> Bio.Blast, with arguments specifying which format the Blast output is
> in.
Having a factory function would be handy, but since the file formats
differ having different classes to model them can be nice.
Modularity is good, and what is known as duck-typing makes it for an
intuitive API.
What would you think of a design such as:
- module/package 'Blast'
- an abstract class 'Output' is defined in that module/package.
- classes '; each one of those classes defines a method 'read()' and
'parse()' (read() and parse() would formally be declared by an
interface, and 'Output' require their implementation).
> 2) Blast records produced by any of the parsers should be consistent
> with each other. As XML output by blast and psi-blast follow the same
> DTD, we should be able to represent both by a single Record class.
Definitely the case for XML - blast/psi-blast... however, the various
formats (XML, others) may contain different levels of details (I do not
know for sure, just considering the possibility here).
> 3) Different parsers should store information in this Record class in
> the same way.
I'd see two options :
- either the same Record class is returned by all parsers
or
- a hierarchy of classes with common accessors and methods whenever
possible (e.g., an abstract parent class (or interface) 'Blast.Record'
with child classes 'Blast.XMLRecord', blahblahblah...)
> 4) The current Blast record stores its information in attributes. If
> you use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains
> the necessary DTDs to do so), the information is stored in
> dictionaries. This has some advantages. For example, it allows you to
> use record.keys() to find out what the record contains. Ideally, I
> think that a Blast Record class should inherit from a dictionary.
Indeed. Attributes also have constrains regarding valid names that
dictionaries do not have.
Still, there is no need to require a strict inheritance from Python's
dict, and require the implementation of the interface (methods such as
__getitem__(), __iter__(), iteritems(), keys(), etc...) might has well
do it. I am thinking of the cost of conversion here: there might be time
where the only purpose is to loop through record and only access limited
information (and in that case a custom class performing a lazy access to
information would be neat). Keeping it as an interface rather than
expect a direct inheritance will give more freedom to implement it,
while keeping compatibility with the rest of the code base.
> 5) We should be able to print a Blast record object to generate
> output that is close to the plain-text output generated by blast.
> This would allow us to generate and store Blast output as XML, and to
> convert the output to plain-text to make it more human-readable.
>
> 6) The current Blast record inherits from Bio.Blast.Record.Header,
> Bio.Blast.Record.DatabaseReport, and Bio.Blast.Record.Parameters. I
> don't see the rationale for this inheritance, and I think we should
> remove it.
>
> Any comments, suggestions (in particular about by proposal to have a
> Blast Record class that inherits from a dictionary? Btw, to avoid
> breaking scripts, I propose that any changes to the Blast record and
> parser are implemented separately from the existing parsers and
> record, and to leave those untouched.
>
> --Michiel.
>
>
>
>
>
> ------------------------------
>
> _______________________________________________ Biopython-dev mailing
> list Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
> End of Biopython-dev Digest, Vol 88, Issue 20
> *********************************************
More information about the Biopython-dev
mailing list