[Biopython] translating 454 data with frameshifts

Jessica Grant jgrant at smith.edu
Fri Dec 10 14:59:38 UTC 2010


We have some transcriptome 454 data and quite simply we are trying to 
build a protein database from the nucleotide sequences.  The problem 
comes in that there are quite a lot of frameshifts in our  contig 
assemblies--and in the original sequences as well.

We have a list of the best blastx hit for each sequence, and I have tried

1 - blasting each sequence against its best hit
2 - taking the hsp_qseqs from the blast output
3 - sticking them together, in order,  if there is more than one hsp.


This has worked for many of the sequences but sometimes there are 
overlapping "best hsp_qseqs" and when I stick them together I get a 
long made-up protein.  Also, for some sequences, the qseq goes past 
the point where the alignment should stop and then when I stick them 
together I get a few extra amino acids in my protein that ought not 
to be there.

Frank Kauff told me that bioperl has a "tile_hsp" function, but 
before I try understanding how that works in a language I am not 
familiar with, I thought I would ask here to see if anyone knows of a 
way to do this in python.

Is there a smart way to concatenate hsps in biopython?  Does anyone 
have a better idea about how to build a protein database from 454 
data?

Thank you!

Jessica



More information about the Biopython mailing list