[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Sun Sep 16 19:17:12 UTC 2012

On Sun, Sep 16, 2012 at 5:24 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Also I noticed that SearchIO (like Bio.Blast) uses attributes
> to store information. I would much rather see a dictionary-like
> interface. This has the advantage that we can keep the key
> name much closer to what is in the original file (for example,
> no need to replace '-' by '_'), and also users can call .keys()
> to find out what is stored in the object.

I don't see a dictionary as being inherently easier to use.
You also use dir(obj) to see the attributes, which are more
flexible as you can implement them as properties and
have code behind them if needed. Another key point is
we can add docstrings to attributes/properties to give
help text - and you can't do that with a dictionary key.

Also different file formats use different terms for what
is really the same idea - I envisioned SearchIO as a
unified parser, which means imposing a common
naming convention for these key fields.

I also think that certain core bits of information common
to BLAST, HMMER, etc should be exposed at the property
level (including query match names and co-ordinates).
Here we're going to standardise start/end values to integers
using Python counting, consistent strand notation etc.

As in the SeqRecord and SeqFeature, a dictionary makes
perfect sense for general 'free form' information. And this
approach is used here too.

Regards,

Peter