[Bioperl-l] Bio::SearchIO::hmmer hsp behaviour

Thu Jun 29 05:23:03 UTC 2006

...

> I think the bug fix you refer to had to do with not returning things
> ordered by E-value -- the creation machinery only only builds Hit
> objects when there are HSP objects being built.  Basically the
> parsing is linear in terms of the file, we read "Model" (Hit) data
> first and store them in a hash keyed by the name of the domain, but
> we only >>build<< the "Hits" when seen HSPs, hence the problem when
> the -A option limits alignments but reports Hits that don't have
> individual alignments.  This has to do with the order of things not
> syncing up and/or dealing with the -A option when there is leftover
> Hit data but no HSPs to populate them.  We also had this problem in
> BLAST reports and had to work around that, but I never bothered
> solving it in HMMER I guess.  Glad there are other people who are
> going to fix the problems!

Yeah, just figured that one out.  I see the two tables are parsed into two
arrays, so it is feasible to have the leftover (Hit/HSP|Model/Domain)
whatever converted into the proper objects like without any alignments (-A0
optional output).  I plan on reporting this in Bugzilla and will work on it,
but can't get to it immediately (probably not 'til Friday-Saturday at the
earliest).  If Sendu wants to tackle it I don't have a problem.

> The one "alignment" (HSP) per hit was a workaround to the problem
> that Hits were being returned in the order the HSPs came in (Sequence
> order) -- because that is the order they were being built in -- not
> in the sorted order of the Hits as seen in the report.

The SAX method, I gather, getting in the way.  

> Feel free to propose an alternative implement for parser as you see
> fit as long as the API is preserved.  you can contibute a new
> SearchIO plugin and HMMERSearchResultListener to deal with it - or I
> guess do what I also do and just run hmmer2table and deal with things
> in a tab-delimited format.

Or set it up as hashes, which you have mentioned before for BLAST.

> Personally my interests lie in the actual domains so the Hit objects
> are superfluous in my own work so it never bothered me to have one
> per Hit and it flows more naturally to things like GFF, etc.  You can
> aggregate them however you like after the fact pretty simply so I
> don't find this too hard to deal with, but if this a major deterrent
> for people I guess have at it ( I think the speed of object creation
> is a larger problem that I hope that someone will work on soon).

Agreed, though now it's finding the time....

Chris 

> I'd appreciate you including the salient points of how the report is
> interpreted on the wiki at some point (with 8X10 glossy pictures and
> circles and arrows on the back...http://en.wikipedia.org/wiki/Alice%
> 27s_Restaurant) so the debate can be archived too.
> 
> -jason
> 
> On Jun 28, 2006, at 7:00 PM, Sendu Bala wrote:
> 
> > Chris Fields wrote:
> >>> Sendu Bala wrote:
> > [snip]
> >>> In any case, this is extremely counter-intuitive, especially given
> >>> that next_domain is a synonym of next_hsp. I think either the
> >>> synonym relationship remains and hits have multiple hsps (and there
> >>> is only one hit per model)
> > [snip]
> >
> >> The model (hit-like) table scores are retained and can be retrieved
> >> via $model->significance and the individual domain (hsp-like) evalues
> >> via $model->evalue.
> >
> > I know, see my earlier post.
> >
> >> The reason you don't get all the individual domain evalues is that
> >> only five alignments are returned by default.  You might try changing
> >> the 'A' parameter to see if you can get more alignments; that may
> >> work around the problem of missing domains for now.
> >
> > [I'm using my own data, not the OP's]
> > No, I have all the alignments: 'A' isn't a problem. And I can get all
> > the domains. The problem is I have to check multiple different hits to
> > find them all.
> >
> >
> >> You'll note that the Model/Domain results returned are not based on
> >> top score but what looks like the position of the domain in the
> >> sequence (seq-t in the last table); that's what is stated in the
> >> hmmpfam docs.
> > [...]
> >> Well, that and SearchIO is set up as a SAX-like parser, so I believe
> >> it processes the model-domain alignments as the file is parsed.
> >
> > Yes, this is the problem. The parser does the obvious thing, but in my
> > view it does not do the correct thing.
> >
> >
> >> Model/domain pairs really aren't Hits/HSPs by definition, like the
> >> CVS commit from Jason states.  The way Pfam is set up, you actually
> >> have your query(ies) scanned using a database of Pfam domains (HMM's,
> >> built from protein alignments for various protein families), hence
> >> the alignment in the report is not a HSP since HSPs come from
> >> pairwise sequence alignments.  An HSP is a pair of sequences which,
> >> when aligned, meet or exceed a maximal cutoff.  The hmmpfam report
> >> has alignments of the sequence and the consensus for the alignment
> >> the HMM is based on (not another sequence, so not an HSP).
> >
> > But this is just semantics. It doesn't /matter/ that its not really
> > truly a sequence that's being aligned. The parser needs to present to
> > the user the information in the file. As we see in the OP's
> > example, it
> > simply fails to do this because the parser isn't model-centric
> > while the
> > file it is parsing /is/.
> >
> > And in any case, your argument doesn't hold because even the current
> > parser /does/ store domains in hsp objects! It just only stores one
> > hsp
> > per hit, repeatedly, which is nonsensical.
> >
> > [to avoid confusion, in the following the use of 'model' is in the
> > programming sense, whilst 'Model' refers to the things generated by
> > hmmer]
> >
> > The correct model to describe the file being parsed is one that is
> > able
> > provide to the user all the available results for all Models that
> > hit a
> > query sequence, even when there are no alignments in the file. To make
> > this fit the SearchIO scheme, we must have one hit per Model. The hit
> > has hsps which are the domains. This perfectly matches the information
> > in the file. It matches something like a Blast, where you have one hit
> > per database sequence/query sequence combo.
> >
> > A hit could end up with no hsps (no domains), but we may not even
> > care.
> > Sometimes you really do just want to know if a particular model hit at
> > all, and with what evalue/score. The current parsing model isn't
> > guaranteed to tell you this even when you can read it yourself in the
> > file being parsed.
> >
> > You can guess at the intent of the original authors, I think, just by
> > looking at those method synonyms. next_hit == next_model. next_hsp ==
> > next_domain. This makes perfect sense. This is the way to correctly
> > model the information in the file. The problem is that next_model
> > doesn't give you the next Model (because each Model has multiple
> > hits),
> > and next_domain doesn't give you the next domain (because each hit
> > only
> > has one domain).
> >
> >
> >> I think the reasoning for keeping single model-domain pairs is that
> >> you should consider each domain's location in the sequence as well as
> >> the number of times they appear, regardless of whether they belong to
> >> the same model or not.  One protein could have three ATP-binding
> >> domains and another two, and they could be located in different
> >> positions on the sequence.  But where they are on the sequence in
> >> relation to other domains and to each other (i.e. positional
> >> information) is just as important, maybe more so, than how many times
> >> that domain appears.
> >
> > Well, that's for the user to decide. But the way the results are
> > presented needs to make sense. If blast results came back with all
> > hsps
> > listed out in sequence position order, would you have multiple hits
> > per
> > database sequence each with one hsp? No, because the meaning is
> > completely wrong. The 'hit' is the collection of alignments of a
> > particular database sequence hitting a query sequence. The alignments
> > are stored in a bunch of hsps. It is absurd to have more than one hit
> > object for a database+query sequence combo, because then we have
> > multiple hit objects duplicating the exact same information, and 'hit'
> > no longer has any meaning - it is a collection of /some/ of the
> > alignments? Yet this is exactly what we have with hmmpfam result
> > parsing.
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l