[Biopython] Help modify this code so it can do what I want it to do

Edson Ishengoma ishengomae at nm-aist.ac.tz
Sun Feb 2 19:28:23 UTC 2014


Hi folks,

I picked this code from somewhere and edited it a bit but it still can't
achieve what I need. I have an xml output of tblastn hits on my customized
database and now I am in the process to extract the results with biopython.
With tblastn sometimes the returned hit is multiple local hits
corresponding to certain positions along the query with significant scores.
Now I want to concatenate these local hits which initially requires sorting
according to positions.

for record in records:
>    for alignment in record.alignments:
>                 hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\
>                                for hsp in alignment.hsps) # sorting results according to positions
>                 complete_query_seq = ''
>                 complete_sbjct_seq =''
>                 for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits:
>                       print title
>                       print 'The query starts from position: ' + str(q_start)
>                       print 'The query ends at position: ' + str(q_end)
>                       print 'The hit starts at position: ' + str(sb_start)
>                       print 'The hit ends at position: ' + str(sb_end)
>                       print 'The  query is: ' + query
>                       print 'The hit is: ' + sbjct
>                       complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments
>                       complete_sbjct_seq += str(query[sb_start:sb_end])
>                print 'Complete query seq is: ' + complete_query_seq
>                print 'Complete subject seq is: ' + complete_sbjct_seq
>
> This would print:

> Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq
> Complete query seq is: ####### query_seq
> Complete subject seq is: ######### hit_seq
>
> This is not what I want as clearly the program did no concatenation at
all, or I messed up seriously. What I want is Complete query seq is: #######
############## (color coded to mean the different portions of query with
significant hits) with no sequence overlaps. How do I achieve that?

Thanks,

Regards,

Edson.




More information about the Biopython mailing list