[Biopython] translating 454 data with frameshifts
McCulloch, Alan
alan.mcculloch at agresearch.co.nz
Sun Dec 12 21:57:15 UTC 2010
Hi Jessica
* There are some packages out there which combine blast and "de-novo" HMM (e.g. ESTScan) evidence to
do translations of transcript contigs - prot4EST is a python based one (blast + ESTScan)
I think there have been others published.
* I have also combined blastx and ESTScan evidence, as follows :
1. blastx contigs against NR protein, recording top (say) 10 hits (*not* using -w option - see below)
2. For those sequences where all HSPs in the same frame, conclude that there are no
frameshift errors, and translate by picking the longest ORF in the same
frame as and overlapping the hsps, and translate.
3. For those seqs with hsps in > 1 frame, conclude that there are frameshift errors and
use ESTScan, which includes these in its model
4. Confirm translations via annotation using blastp against NR
(I have some python code for bits of this happy to share if useful)
( no use for unknowns obviously - only option for these is something like ESTScan)
* Have you tried using the -w option of blastx ? (Frame shift penalty (OOF algorithm for blastx)) -
blastx may be able to figure out the frameshift errors for you and generate a single
merged alignment, using this option. We have had fairly good luck with -w 20. In order to
reduce the chances of alignments with spurious frameshifts, you could try using
blastx -w in step 3 above, as an alternative to ESTScan - i.e. where you then
already know there are frameshifts, or you could use to check ESTScan predictions
Cheers
AMcC
-----Original Message-----
From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Jessica Grant
Sent: Saturday, 11 December 2010 4:00 a.m.
To: biopython at biopython.org
Subject: [Biopython] translating 454 data with frameshifts
We have some transcriptome 454 data and quite simply we are trying to
build a protein database from the nucleotide sequences. The problem
comes in that there are quite a lot of frameshifts in our contig
assemblies--and in the original sequences as well.
We have a list of the best blastx hit for each sequence, and I have tried
1 - blasting each sequence against its best hit
2 - taking the hsp_qseqs from the blast output
3 - sticking them together, in order, if there is more than one hsp.
This has worked for many of the sequences but sometimes there are
overlapping "best hsp_qseqs" and when I stick them together I get a
long made-up protein. Also, for some sequences, the qseq goes past
the point where the alignment should stop and then when I stick them
together I get a few extra amino acids in my protein that ought not
to be there.
Frank Kauff told me that bioperl has a "tile_hsp" function, but
before I try understanding how that works in a language I am not
familiar with, I thought I would ask here to see if anyone knows of a
way to do this in python.
Is there a smart way to concatenate hsps in biopython? Does anyone
have a better idea about how to build a protein database from 454
data?
Thank you!
Jessica
_______________________________________________
Biopython mailing list - Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the Biopython
mailing list