[Biopython] Help modify this code so it can do what I want it to do
Edson Ishengoma
ishengomae at nm-aist.ac.tz
Sun Feb 2 19:28:23 UTC 2014
Hi folks,
I picked this code from somewhere and edited it a bit but it still can't
achieve what I need. I have an xml output of tblastn hits on my customized
database and now I am in the process to extract the results with biopython.
With tblastn sometimes the returned hit is multiple local hits
corresponding to certain positions along the query with significant scores.
Now I want to concatenate these local hits which initially requires sorting
according to positions.
for record in records:
> for alignment in record.alignments:
> hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\
> for hsp in alignment.hsps) # sorting results according to positions
> complete_query_seq = ''
> complete_sbjct_seq =''
> for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits:
> print title
> print 'The query starts from position: ' + str(q_start)
> print 'The query ends at position: ' + str(q_end)
> print 'The hit starts at position: ' + str(sb_start)
> print 'The hit ends at position: ' + str(sb_end)
> print 'The query is: ' + query
> print 'The hit is: ' + sbjct
> complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments
> complete_sbjct_seq += str(query[sb_start:sb_end])
> print 'Complete query seq is: ' + complete_query_seq
> print 'Complete subject seq is: ' + complete_sbjct_seq
>
> This would print:
> Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq
> Complete query seq is: ####### query_seq
> Complete subject seq is: ######### hit_seq
>
> This is not what I want as clearly the program did no concatenation at
all, or I messed up seriously. What I want is Complete query seq is: #######
############## (color coded to mean the different portions of query with
significant hits) with no sequence overlaps. How do I achieve that?
Thanks,
Regards,
Edson.
More information about the Biopython
mailing list