[Bioperl-l] Fwd: [blast-help] Fwd: Question about the definition of 'gaps' in blast -m8 output...

Dan Bolser dan.bolser at gmail.com
Fri Mar 20 17:13:59 UTC 2009


Here is what the man from the NCBI said:


---------- Forwarded message ----------
From: Peter Cooper <cooper at ncbi.nlm.nih.gov>
Date: 2009/3/20
Subject: Re: [blast-help] Fwd: Question about the definition of 'gaps'
in blast -m8 output...
To: dan.bolser at gmail.com
Cc: blast-help at ncbi.nlm.nih.gov


Hello,

The number reported tin the -m  8 output is the number of gap
openings. This will only equal the number of gap characters if the
length of each gap is 1.


Peter
-------------------------------
Peter S. Cooper, Ph.D.
Public Services
The National Center for Biotechnology Information
301-435-5951






On Mar 19, 2009, at 12:04 PM, romiti wrote:

>
>
> Begin forwarded message:
>
>> From: User Services Service Account <info at ncbi.nlm.nih.gov>
>> Date: March 19, 2009 10:23:01 AM EDT
>> To: romiti at ncbi.nlm.nih.gov
>> Subject: Question about the definition of 'gaps' in blast -m8 output...
>> Reply-To: User Services Service Account <info at ncbi.nlm.nih.gov>
>>
>>
>> ------------- Begin Forwarded Message -------------
>>
>> Date: Wed, 18 Mar 2009 19:31:01 +0000
>> Subject: Question about the definition of 'gaps' in blast -m8 output...
>> From: Dan Bolser <dan.bolser at gmail.com>
>> To: bbb at bioinformatics.org, bioperl-l at lists.open-bio.org, info at ncbi.nlm.nih.gov
>>
>>
>> Hi,
>>
>> I'm sure this question comes up again and again, but searching the BioPerl
>> mailing list didn't turn up any answers (to the second question). Basically
>> I want to manually merge HSP's into 'contigious hits', and I want to look at
>> the effect of various parameters on an algorithm to do this. This task
>> prompted me to run a 'sanity check' on the blast data that I had, and I
>> found that this check fails to fulfil my expectation of the data. This means
>> that either I don't understand the data or the results are buggy.
>>
>> Can someone clarify the definition of the 'gaps' column in the blast -m8
>> output format for me?
>>
>> I thought that the column 'gaps' was basically the number of columns in the
>> HSP that contains a gap character. To test this on my data, I checked the
>> following equality:
>>
>> GAPS + 2 =
>> LENGTH - abs(QUERY_END - QUERY_START) + LENGTH - abs(HIT_END - HIT_START)
>>
>>
>> This says that the number of GAPS should be equal to the difference between
>> the LENGTH of the alignment minus the distance between the START and END
>> point on either the QUERY or the HIT (+2 for the 'off by one' error
>> introduced by the two END-START calculations).
>>
>> e.g.
>>
>> 10-> MMMMMMMM**MMMM*M <-22
>>    ||||  ||  | |  |
>> 20-> MMMM**MMMMM*M*MM <-31
>>
>>
>> where MISMATCHES = 7, LENGTH = 16, QUERY_END - QUERY_START = 12, and HIT_END
>> - HIT_START = 11. The formula gives:
>>
>> 7+2(9) = 16-12(4) + 16-11(5)
>>
>>
>> The formula is correct for 11,282 out of 12,745 HSPs in my dataset (89%),
>> however it fails for 1,463 cases (11%). Each of these cases has a value of
>> MISMATCHES smaller than calculated by the formula. The difference is usually
>> 1 or 2, but is seen to go as high as 96, and scales roughly linearly with
>> the size of GAPS.
>>
>>
>> Did I misunderstand what the value of GAPS is supposed to mean? How come it
>> does apparently mean what I thought for so much of the data?
>>
>>
>> Thanks very much for any help on the above.
>>
>> Dan.
>>
>> ------------- End Forwarded Message -------------



More information about the Bioperl-l mailing list