[Bioperl-l] Bio::Search::HSP::GenericHSP::seq_inds

Chris Fields cjfields at uiuc.edu
Thu Mar 27 12:04:20 EDT 2008


On Mar 27, 2008, at 10:09 AM, Marc Logghe wrote:

> Hi Sendu, Chris
>
>>> At least that is what one would expect looking at the example usage
> in
>>> the HOWTOs
> (http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods
>>> second code block).
>>>
>>> Am I correct in believing you can only do this if your hsp query
> stretch
>>> starts at position 1 of the query sequence?
>>
>> No...
>
> <HOWTO>
> # put all the conserved matches in query strand into an array
> my @str_array = split "",$hsp->query_string;
> foreach ( $hsp->seq_inds('query','conserved') ){
>  push @conserved,$str_array[$_ - 1];
> }
> </HOWTO>
>
> $hsp->query_string will return
> 'IAVEEETKTTKKNKKQ-QQQANKNKNKNKKK--TTIAPEAAIDANIAAEVHTQVL'
>
> In my example using the 'gap' class (instead of 'conserved'),  
> @str_array
> will contain 417, 431 and 432. The off-by-one indices do not exist in
> that array.
> Therefore, I still think the howto shows a special case where the hsp
> query sequence starts at 1 (compared to 402 in my particular example).

We'll have to look at it; it should probably be clarified particularly  
in reference to 'gaps' and use of seq positions vs. HSP (or alignment)  
positions.

Think of it this way; seq_inds() takes 'identical', 'conserved', etc.,  
all of which refer to the original positions (indices) of the sequence  
which fall into the particular category asked for.  In these cases we  
are using the coordinates for query/hit directly from the HSP info in  
the report.  This is done with the express purpose of mapping  
attributes back to the original sequence, be it the query or subject.

Gaps, however, are tricky, since sequence coordinates refer to  
residues (not gaps) when using BLAST.  In this case we use the  
sequence position prior to the gap to note where a gap is inserted.   
The previous results, then, would be wrong as there is no gap inserted  
after 432.  I just committed a fix which just repeats the position  
based on the number of gaps.

>>> Query  402  IAVEEETKTTKKNKKQ-QQQANKNKNKNKKK--TTIAPEAAIDANIAAEVHTQVL
> 453
>>>            +A+ E   TT K +KQ ++  NK  NK  KK  T+  P+AA+ + I AE  +Q L
>>> Sbjct  139  VALREALSTTFKTQKQLKENLNKQPNKQWKKSQTSTNPKAALKSKIVAEFRSQAL
> 193
>>>
>>> Now, if you call seq_inds(query => 'gap') on that particular hsp
> object,
>>> you get these positions: 417, 431, 432. Obviously, there is no gap
> in
>>> the original query sequence at these positions.
>>> How do you have to read these numbers ? Remark also that for
> instance
>>> 417 is the res just in front of the gap.
>>
>> Its purpose is to let you know the position in query or subject
>> coordinates where something interesting happened in the alignment. So
>> seq_inds(query => 'gap') is telling you all the places that a gap
> starts
>> in the alignment in terms of the query coordinates. Hence 417 etc.
>
> So, this means you have to interpret that as a gap is coming after  
> 417 ?

Yes.

>> (Actually, does 432 make sense? Shouldn't it be 431 twice?)
> Don't know, depends on how you have to 'read' this.
> Thanks for looking into this.
> Regards,
> Marc

Repeating the position based on the number of gaps is now the default  
in bioperl-live.  Just working on fixing problems with collapsing  
numbers and tests and everything should be fine.

chris


More information about the Bioperl-l mailing list