[Biopython-dev] Working with the new SearchIO API

Wibowo Arindrarto w.arindrarto at gmail.com
Mon Oct 29 22:55:19 UTC 2012


Hi Kai,

Thanks for the input & comments! I made the API change mainly because
I want to keep the SearchIO object hierarchy more consistent, i.e.
there should be as few places as possible to make changes that break
the model.

There are several attributes that should remain the same between a
single QueryResult object and the Hits, HSPs, and HSPFragments it
contain. For now, these attributes are the ID (both query and hit ID)
and description (also for both query and ID). In the old API, each
object in the object model hierarchy stores these values as its own
attribute. For example, to store the ID of the Hit object, the old API
has the 'id' attribute in the Hit object, 'hit_id' attribute in all
HSP objects it contains, and 'hit_id' attributes in all HSPFragment
contained by each HSP in the Hit. I see this as unecessary
duplications and a possible source of confusion, since these
attributes are completely decoupled from one another even though they
mean the same thing.

The new API stores the these values only at the innermost object in
the hierarchy (the HSPFragment), reducing duplications and possible
sources of inconsistencies. When you access the attributes from
objects other than the HSPFragment, a getter retrieves it from one of
the contained HSPFragment object, after ensuring that all HSPFragment
contain the same value of the attribute
(https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/_utils.py#L99).
Similarly, when you set the attribute, a setter applies the new value
to all HSPFragment objects contained
(https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/_utils.py#L106).

This allows you to keep the values consistent across the hierarchy, so
long as the change is done at the highest level possible (e.g.
changing the hit ID in the HSP object will break consistency, but
changing hit ID through the Hit object will update the hit_id
attribute value across all HSPs it contains). Conceptually, this is
also closer to the real 'Hit' object we're modeling since we always
need at least one HSP to declare a database entry as a Hit.

The HMMER parser's update is partially influenced by this API change,
as you've seen. In the previous version
(https://github.com/bow/biopython/blob/12fbe05c5e17f7a356ab672358b2698612aa8cad/Bio/SearchIO/HmmerIO/hmmertext.py),
the HMMER parser has several ugly bits (e.g. it sets the hit
description in more than one place, a possible source of error). After
changing the API to force the creation of Hits with HSPs, these kinds
of duplications are eliminated. I personally also feel that using the
new API allows me (sometimes forces me) to improve the other format's
parsers in a similar way.

It's unfortunate that the HMMER text parser is made a little difficult
to understand, due to the way HMMER arranges the text output format.
And I admit I didn't do any performance benchmark for the HMMER text
parser when I made the change (I suspected one extra dictionary per
Hit object should not decrease performance that much. Of course, if
the change proves to cause severe performance penalties, then yes, we
should look into it again.).

But for now, I think these are acceptable tradeoffs, if it means the
object model becomes more consistent and the other format parsers
improved as well.

Hope that helps :).

regards,
Bow

P.S. As for the misleading part, yes, I admit that maybe a different
name should be used to note that the contents of the list differ.


On Mon, Oct 29, 2012 at 9:43 PM, Kai Blin
<kai.blin at biotech.uni-tuebingen.de> wrote:
> Hi Bow,
>
> I've been looking closer at the SearchIO API changes introduced in
> August. I think there still is a design problem with the object model,
> at least when looking at how this affects the hmmer3 parser (and affects
> the hmmer2 parsing as well).
>
> Possibly I'm not seeing the big picture here, so let me explain what I'm
> seeing, and then you can tell me what I missed. :)
>
> So, the hmmer2 and hmmer3 file format basically looks like this
>
> # header
> # ...
> # ...
>
> information about the query
>
> list of hits
>
> list of hsps
>
> (alignments for hsps)
>
> (some statistics)
> //
>
> Now, when parsing this file line-wise, you obviously run into the hits
> first. However, with the new API, you can't create a Hit object without
> knowing the HSPs, but you haven't read them yet.
>
> To work around this, you need to create a fake hit object
> (https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/HmmerIO/hmmer3_text.py#L201).
> Then, in the loop that creates the fake hit objects, one of the exit
> conditions then parses the HSP entries and then replaces the fake hit
> objects by "real" Hit objects.
> (https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/HmmerIO/hmmer3_text.py#L188)
>
> By the way, that code is a bit misleading. Took me a while to notice the
> switch of the list's contents. Anyway, back to business.
>
> So basically you need to create two hit objects for every hit you're
> looking at. What's the advantage of forcing Hsp objects to be passed to
> the Hit constructor? Just to make sure your Hit objects have a valid Hsp
> at some later point?
>
> I'm aware that I'm just looking at the SearchIO design from the
> perspective of the hmmer2 parser, but I'd like to understand the reasons
> for the API being the way it currently is.
>
> Hope you can shed some light on this,
> Kai
>
> --
> Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
> Institute for Microbiology and Infection Medicine
> Division of Microbiology/Biotechnology
> Eberhard-Karls-University of Tübingen
> Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
> D-72076 Tübingen                        Fax :   ++49 7071 29-5979
> Deutschland
> Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben




More information about the Biopython-dev mailing list