[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?

Tue Oct 27 18:13:22 EDT 2009

Peter, Chris,

Thank you muchly for your expert and well presented dialog.  Yes here  
is an actual and typical problem in generating protein seq from viral  
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS       DQ848678               29277 bp    RNA     linear   VRL 12- 
SEP-2006
DEFINITION  Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION   DQ848678
VERSION     DQ848678.1  GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,  
backs up one nuc, changes frame, and then continues:
      CDS             join(311..12358,12358..20391)
                      /ribosomal_slippage
                      /codon_start=1
                      /product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides  
towards whom we would actually love to focus our comparative  
bioinformatic scrutiny, but none of which have mappable IDs or seqs  
below the polyprotein level as can be seen:
      mat_peptide     15118..16914	<===
                      /product="nsp13"	<===
                      /note="helicase"  <==these are all we have to go  
on
and where are given no ID below the polyprotein level, no protein  
sequence...just positions. They are nuc positions at that, but we are  
given the complete polyprotein seq and have the components to do this  
on paper, but no code.  In summary we would like to dump in genbank  
files to a method, and get out fasta protein files which have some IDs  
and seqs.

You guys are forcing me (thank god) to think critically and clearly  
about it too, so let me extend the proposed module or method as best I  
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/ 
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide  
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus  
isolate or CDS set for your virus

PHEW.

Chris L

==========
>
> I think one could use the full-length protein and run TFASTX (which  
> allows frameshifts) against the nucleotide sequence.  The output  
> will have the frameshifts designated with '/' or '\', so it would  
> then be a matter of splitting the sequence based on the midline,  
> then mapping those protein fragments back to the original sequence  
> coordinates.  Is this along the lines of what you mean?
>
> chris

Let me look into this thank you CF, I have not used that in the past.