[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Sun Sep 16 17:44:23 UTC 2012

Hi Michiel,

> Thanks for the links! This is actually the first time I looked at the SearchIO module in detail.

You're welcome :).

> I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already.
>
> So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast.

The plan that Peter and I discussed was indeed to eventually deprecate
Bio.Blast in favor of SearchIO. I prefer not to use Bio.Blast
precisely for the reason you mentioned. I think we last discussed that
we may use Bio.Seq.Search as the name (or bio.seq.search, after we
settled on the namespace).

Also, the bio.seq.search (or whatever we will call it) module will
have wrappers for sequence search command line and web tools. Of
course, this won't be for BLAST only. In another branch, I've written
a draft HMMER wrapper and a partial BLAT wrapper. For the web tool,
the HMMER devs also have a web service for which we could create a
wrapper.

> Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object.
>
> Best,
> -Michiel.

If you are talking about using the slice notation to retrieve object
attributes, that could be difficult for users. Most of the current
SearchIO objects are themselves containers of other objects (the
object model is nested). I could try implementing some hacks so that
the attributes are stored in a dictionary, but I think this would
confuse users when they use the slice notation (am I retrieving an
attribute or a nested SearchIO object?).

Maybe what you have in mind is a single dictionary stored as an object
attribute as the interface? For example, we could have object.attribs
as the dictionary and we could use object.attribs['e-value'] for
example). We do gain '-' instead of '_' and `.keys()` using this, but
at the cost of brevity, so I have a mixed feeling towards this.

If users want to find out what the attributes are, they can use
object.__dict__.keys(). I could try create a common property (e.g.
object.attrib_names) that returns a list of all available attribute
names for a given object. But for now, this seems a little bit too
excessive for me (could be done if more people desire otherwise,
though).

Thanks for taking a look, by the way. Always appreciate a new set of
fresh perspectives :).

regards,
Bow