[Biopython-dev] Need some help with SearchIO HSPs cascading attributes.
Wibowo Arindrarto
w.arindrarto at gmail.com
Wed Dec 5 11:39:13 EST 2012
Hi Kai and everyone,
Very happy to see the parser near completion (with tests too!). The
issue you're facing is unfortunately the consequence of trying to keep
attribute values in sync across the object hierarchy. It is a bit
troublesome for now, but not without solution.
> However, no matter what I do, I seem to get an <unknown description>
> tossed in there somehow.
>
> The parser is at
> https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py
> the test code is at
> https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py
> and the test file that's failing is the hmmpfam2.3 file at
> https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out
'<unknown_description>' is the default value for any description
attribute (be it in the QueryResult object, or in the
HSPFragment.hit_description). The error you're seeing is because the
hit description is being accessed through the hit object
(hit.description) and the cascading property getter checks first whether all
HSP contains the same `hit_description` attribute value. It'll only
return the value if all HSPFragment.hit_description values are equal.
Otherwise, it'll raise the error you're seeing here.
In your case, there are two values: 'Conserved region in glutamate
synthas' and '<unknown_description>', while there should only be one
(the first one). After prodding here and there, it seems that this is
caused by the if clause here:
https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py#L191
The 'else' clause in that block adds the HSP to the hit object, but
does not do any cascading attribute assignment (query_description and
hit_description).
Here, the simple fix would be to force a description assignment to the
HSP. For example, you could have the `else` block like so:
...
else:
hit = unordered_hits[id_]
hsp.hit_description = hit.description
hit.append(hsp)
Other fixes are of course possible, but this is the simplest I can
imagine (though it seems a bit crude).
Also, I would like to note that the query description assignment of
the parser may break the cascade as well. If you try to access
`qresult.description` (qresult being the QueryResult object), you'd
get the true query description. But if you try to access it from
`qresult[0].query_description` (the query description stored in the
hit object), you'd get '<unknown_description>'. The fix here would be
to assign the description at the last moment before the QueryResult
object is yielded. That way, the cascading setter works properly and
all Hit, HSP, and HSPFragment inside the QueryResult object will
contain the same value.
I realize that this approach is not without flaws (and I'm always open to
suggestions), but at the moment this seems to be the most sensible way
to keep the attribute values in-sync while keeping the objects more
user-friendly
(i.e. making the parser slightlymore complex to write, but with the
result of consistent attribute
value to the users).
Hope this helps!
Bow
More information about the Biopython-dev
mailing list