[Bioperl-l] Bio::Search::HSP::GenericHSP::seq_inds
Chris Fields
cjfields at uiuc.edu
Thu Mar 27 12:04:20 EDT 2008
On Mar 27, 2008, at 10:09 AM, Marc Logghe wrote:
> Hi Sendu, Chris
>
>>> At least that is what one would expect looking at the example usage
> in
>>> the HOWTOs
> (http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods
>>> second code block).
>>>
>>> Am I correct in believing you can only do this if your hsp query
> stretch
>>> starts at position 1 of the query sequence?
>>
>> No...
>
> <HOWTO>
> # put all the conserved matches in query strand into an array
> my @str_array = split "",$hsp->query_string;
> foreach ( $hsp->seq_inds('query','conserved') ){
> push @conserved,$str_array[$_ - 1];
> }
> </HOWTO>
>
> $hsp->query_string will return
> 'IAVEEETKTTKKNKKQ-QQQANKNKNKNKKK--TTIAPEAAIDANIAAEVHTQVL'
>
> In my example using the 'gap' class (instead of 'conserved'),
> @str_array
> will contain 417, 431 and 432. The off-by-one indices do not exist in
> that array.
> Therefore, I still think the howto shows a special case where the hsp
> query sequence starts at 1 (compared to 402 in my particular example).
We'll have to look at it; it should probably be clarified particularly
in reference to 'gaps' and use of seq positions vs. HSP (or alignment)
positions.
Think of it this way; seq_inds() takes 'identical', 'conserved', etc.,
all of which refer to the original positions (indices) of the sequence
which fall into the particular category asked for. In these cases we
are using the coordinates for query/hit directly from the HSP info in
the report. This is done with the express purpose of mapping
attributes back to the original sequence, be it the query or subject.
Gaps, however, are tricky, since sequence coordinates refer to
residues (not gaps) when using BLAST. In this case we use the
sequence position prior to the gap to note where a gap is inserted.
The previous results, then, would be wrong as there is no gap inserted
after 432. I just committed a fix which just repeats the position
based on the number of gaps.
>>> Query 402 IAVEEETKTTKKNKKQ-QQQANKNKNKNKKK--TTIAPEAAIDANIAAEVHTQVL
> 453
>>> +A+ E TT K +KQ ++ NK NK KK T+ P+AA+ + I AE +Q L
>>> Sbjct 139 VALREALSTTFKTQKQLKENLNKQPNKQWKKSQTSTNPKAALKSKIVAEFRSQAL
> 193
>>>
>>> Now, if you call seq_inds(query => 'gap') on that particular hsp
> object,
>>> you get these positions: 417, 431, 432. Obviously, there is no gap
> in
>>> the original query sequence at these positions.
>>> How do you have to read these numbers ? Remark also that for
> instance
>>> 417 is the res just in front of the gap.
>>
>> Its purpose is to let you know the position in query or subject
>> coordinates where something interesting happened in the alignment. So
>> seq_inds(query => 'gap') is telling you all the places that a gap
> starts
>> in the alignment in terms of the query coordinates. Hence 417 etc.
>
> So, this means you have to interpret that as a gap is coming after
> 417 ?
Yes.
>> (Actually, does 432 make sense? Shouldn't it be 431 twice?)
> Don't know, depends on how you have to 'read' this.
> Thanks for looking into this.
> Regards,
> Marc
Repeating the position based on the number of gaps is now the default
in bioperl-live. Just working on fixing problems with collapsing
numbers and tests and everything should be fine.
chris
More information about the Bioperl-l
mailing list