[Biopython-dev] Working with the new SearchIO API

Tue Oct 30 07:35:40 UTC 2012

On 2012-10-29 23:55, Wibowo Arindrarto wrote:

Hi Bow,

> Thanks for the input & comments! I made the API change mainly because
> I want to keep the SearchIO object hierarchy more consistent, i.e.
> there should be as few places as possible to make changes that break
> the model.

Thanks for the explanation.

...

> This allows you to keep the values consistent across the hierarchy, so
> long as the change is done at the highest level possible (e.g.
> changing the hit ID in the HSP object will break consistency, but
> changing hit ID through the Hit object will update the hit_id
> attribute value across all HSPs it contains). Conceptually, this is
> also closer to the real 'Hit' object we're modeling since we always
> need at least one HSP to declare a database entry as a Hit.

I see. I didn't think about the programmatic side of things. I see the
advantage of having only one attribute there and of keeping it consistent.

> The HMMER parser's update is partially influenced by this API change,
> as you've seen. In the previous version
> (https://github.com/bow/biopython/blob/12fbe05c5e17f7a356ab672358b2698612aa8cad/Bio/SearchIO/HmmerIO/hmmertext.py),
> the HMMER parser has several ugly bits (e.g. it sets the hit
> description in more than one place, a possible source of error). After
> changing the API to force the creation of Hits with HSPs, these kinds
> of duplications are eliminated. I personally also feel that using the
> new API allows me (sometimes forces me) to improve the other format's
> parsers in a similar way.

Arguably, the more human-readable the file you need to parse, the less
readable the parser tends to be. ;) I think the old parser was a more
straightforward piece of code.

> It's unfortunate that the HMMER text parser is made a little difficult
> to understand, due to the way HMMER arranges the text output format.
> And I admit I didn't do any performance benchmark for the HMMER text
> parser when I made the change (I suspected one extra dictionary per
> Hit object should not decrease performance that much. Of course, if
> the change proves to cause severe performance penalties, then yes, we
> should look into it again.).

I'm not talking about performance here, performance likely isn't a
problem. I'm saying that you're conceptually creating the Hit object
twice. Even the comment in line 200 says so. :)

[snip]
            # create the hit object
            hit_attrs = {
                'id': row[8],
                'query_id': qid,
                'evalue': float(row[0]),
                'bitscore': float(row[1]),
                'bias': float(row[2]),
                # row[3:6] is not parsed, since the info is available
                # at the the HSP level
                'domain_exp_num': float(row[6]),
                'domain_obs_num': int(row[7]),
                'description': row[9],
                'is_included': is_included,
            }
            hit_list.append(hit_attrs)
[snip]

I'm mainly wondering why at this position, I can't just create the Hit
object already, and then later set the HSPs. You could do this via a
setter function that validates the IDs are identical if you want to make
sure you're not shooting yourself in the foot there.

> But for now, I think these are acceptable tradeoffs, if it means the
> object model becomes more consistent and the other format parsers
> improved as well.

I haven't looked into the other parsers, so I'll take your word on that.
I can of course take the same detour of creating a placeholder hit
object for the first pass and then when I've parsed the HSPs create the
real Hit object. If this makes all the other parsers more readable at
the cost of some obscurity in the hmmer text parsers, well, so be it.

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-University of Tübingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 Tübingen                        Fax :   ++49 7071 29-5979
Deutschland
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben