[BioPython] HSPs in Blast parser

Brad Chapman chapmanb at uga.edu
Wed May 5 14:56:19 EDT 2004


Hi Jawad;
Sorry for the delay in getting back with you.

> I am stuck on parsing a BlastN output and would appreciate some help. 
> I am working on multiple HSPs for a single hit . For example if there 
> are two hsps found for one hit, I need to find where query and subject 
> ends for one hsp and then compare it with the query and subject start 
> for the next hsp
> 
> I have noticed that in the blast parser one can iterate through each 
> hsp for every single hit, but am not too sure how to treat two hsps of 
> a single hit as related and iterate through the two hsps of a single hit 
> in order to find the query (and subject) end of one and query (and subject) 
> start of the other.

You just need to iterate through each alignment and then through the
HSPs in each alignment. Then you can collect up the start and end
coordinates of each HSP and associate them with the overall
alignment. So, step by step:

1. Iterate through each alignment.
2. Create lists of HSP information for queries and subjects
  2a. Iterate through each HSP
  2b. Get the start and end coordinates from the hsp.query_start and
      hsp.sbjct_start attributes of the hsps
  2c. Calculate the HSP ends using the query and subject
      information, removing gaps added by BLAST
  2d. Add each start, end to the lists
3. You have the HSP lists, associated with the alignment title.

Here is some code which demonstrates this:

from Bio.Blast import NCBIStandalone

parser = NCBIStandalone.BlastParser()
rec = parser.parse(open("blast.out"))

for align in rec.alignments:
    query_hsps = []
    sbjct_hsps = []
    for hsp in align.hsps:
        # adjust query starts to python (0-based) coordinates
        query_start = hsp.query_start - 1
        # get the end from the length minus gaps
        query_end = query_start + len(hsp.query.replace("-", ""))
        sbjct_start = hsp.sbjct_start - 1
        sbjct_end = sbjct_start + len(hsp.sbjct.replace("-", ""))
        query_hsps.append((query_start, query_end))
        sbjct_hsps.append((sbjct_start, sbjct_end))
    print "Hit", align.title
    print "Query HSPs", query_hsps
    print "Subject HSPs", sbjct_hsps

Hope this helps!
Brad


More information about the BioPython mailing list