[Bioperl-l] Question about the definition of 'gaps' in blast -m8 output...

Dan Bolser dan.bolser at gmail.com
Wed Mar 25 14:07:12 UTC 2009


2009/3/24 Phillip San Miguel <pmiguel at purdue.edu>:
> Dan Bolser wrote:
>>
>> 2009/3/19 Phillip San Miguel <pmiguel at purdue.edu>:
>>
>>>
>>> Dan Bolser wrote:
>>>
>>>>
>>>> 2009/3/18 Phillip San Miguel <pmiguel at purdue.edu>
>>>>
>>>>
>>>>
>>>>>
>>>>> Dan Bolser wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Can someone clarify the definition of the 'gaps' column in the blast
>>>>>> -m8
>>>>>> output format for me?
>>>>>>
>>>>>> I thought that the column 'gaps' was basically the number of columns
>>>>>> in
>>>>>> the
>>>>>> HSP that contains a gap character.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Hi Dan,
>>>>> "gaps", to me, denotes the number of gaps. Not the total length of all
>>>>> the
>>>>> gaps.
>>>>> Just my interpretation, but given your results my guess is that
>>>>> whomever
>>>>> wrote blastall was thinking the way I do.
>>>>>
>>>>>
>>>>
>>>> Yeah, I'll have to go look at the HSPs to confirm this... I'm just
>>>> surprised
>>>> that there are not more gaps of length >1. i.e. my data (given your
>>>> interpretation) suggests that 90% of the HSPs have no gaps > length 1.
>>>>
>>>>
>>>
>>> Sounds about right. Depends on how you have gap opening vs gap
>>> lengthening
>>> parameters set.
>>>
>>
>> I see. I thought that by default extension was less than opening, so I
>> had expected there to be more gaps of length >1 ... anyway... where
>> can I read more about selecting parameters for certain tasks?
>> Currently I'm blasting tomato against potato sequence, and the two
>> organisms are known to be 'highly syntenic' - I'm just not sure how
>> that translates into how I should set the parameters. I'm after large
>> alignments of large regions of the chromosome. My thinking is to just
>> run through the list of HSPs and merge based on gap / window size
>> (dynamic programming style) - that way I can play with the set of HSPs
>> that I have, and look at the effect of different settings, then I can
>> just globally align the matching regions using SW (if I need to). Does
>> that sound reasonable, or is using the default settings just dumb?
>>
>> Cheers,
>> Dan.
>>
>>
>
> Hi Dan,
>   Sorry, I didn't mean to suggest that the only reason you were seeing a
> preponderance of single base indels was due to your settings. I do expect
> single base indels to outnumber longer indels.

OK. I didn't expect that, but its good to know.


>   Nevertheless, I always thought standard alignment tools should not use a
> linear gap extension penalty. That past some point, extending a gap should
> be further "discounted". Maybe the gap extension penalty should be the log
> of the number of bases extended?
>   BTW, I just noticed that the blastall '-m 9' parameter, includes the
> column headers. They are:
>
> # Fields: Query id, Subject id, % identity, alignment length, mismatches,
> gap openings, q. start, q. end, s. start, s. end, e-value, bit score

Good find!

It would be great to update some of the 'downstream' docs with that
column name :D


Thanks again,
Dan.

> So, the column in question is "gap openings".
> Phillip
>




More information about the Bioperl-l mailing list