[Biopython] translating 454 data with frameshifts

Sun Dec 12 21:57:15 UTC 2010

Hi Jessica

* There are some packages out there which combine blast and "de-novo" HMM (e.g. ESTScan) evidence to 
  do translations of transcript contigs - prot4EST is a python based one (blast + ESTScan)

  I think there have been others published.

* I have also combined blastx and ESTScan evidence, as follows : 

   1. blastx contigs against NR protein, recording top (say) 10 hits (*not* using -w option - see below)

   2. For those sequences where all HSPs in the same frame, conclude that there are no 
       frameshift errors, and translate by picking the longest ORF in the same 
       frame as and overlapping the hsps, and translate. 

   3. For those seqs with hsps in > 1 frame, conclude that there are frameshift errors and 
       use ESTScan, which includes these in its model

   4. Confirm translations via annotation using blastp against NR

   (I have some python code for bits of this happy to share if useful)

   ( no use for unknowns obviously - only option for these is something like ESTScan)

* Have you tried using the -w option of blastx ? (Frame shift penalty (OOF algorithm for blastx)) - 
  blastx may be able to figure out the frameshift errors for you and generate a single 
  merged alignment, using this option. We have had fairly good luck with -w 20. In order to
  reduce the chances of alignments with spurious frameshifts, you could try using
  blastx -w in step 3 above, as an alternative to ESTScan - i.e. where you then
  already know there are frameshifts, or you could use to check ESTScan predictions

Cheers

AMcC

-----Original Message-----
From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Jessica Grant
Sent: Saturday, 11 December 2010 4:00 a.m.
To: biopython at biopython.org
Subject: [Biopython] translating 454 data with frameshifts

We have some transcriptome 454 data and quite simply we are trying to 
build a protein database from the nucleotide sequences.  The problem 
comes in that there are quite a lot of frameshifts in our  contig 
assemblies--and in the original sequences as well.

We have a list of the best blastx hit for each sequence, and I have tried

1 - blasting each sequence against its best hit
2 - taking the hsp_qseqs from the blast output
3 - sticking them together, in order,  if there is more than one hsp.

This has worked for many of the sequences but sometimes there are 
overlapping "best hsp_qseqs" and when I stick them together I get a 
long made-up protein.  Also, for some sequences, the qseq goes past 
the point where the alignment should stop and then when I stick them 
together I get a few extra amino acids in my protein that ought not 
to be there.

Frank Kauff told me that bioperl has a "tile_hsp" function, but 
before I try understanding how that works in a language I am not 
familiar with, I thought I would ask here to see if anyone knows of a 
way to do this in python.

Is there a smart way to concatenate hsps in biopython?  Does anyone 
have a better idea about how to build a protein database from 454 
data?

Thank you!

Jessica
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================