[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?

Tue Oct 27 23:39:56 UTC 2009

It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill

> Peter, Chris,
>
> Thank you muchly for your expert and well presented dialog.  Yes here
> is an actual and typical problem in generating protein seq from viral
> polyproteins, in the absence of mat_peptide Seq and unique ID:
>
> For record:
>
> http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
>
> this is the coronavirus:
> LOCUS       DQ848678               29277 bp    RNA     linear   VRL 12-
> SEP-2006
> DEFINITION  Feline coronavirus strain FCoV C1Je, complete genome.
> ACCESSION   DQ848678
> VERSION     DQ848678.1  GI:112253723
> Containing a poly protein 1ab where the ribosome stalls at 12358,
> backs up one nuc, changes frame, and then continues:
>       CDS             join(311..12358,12358..20391)
>                       /ribosomal_slippage
>                       /codon_start=1
>                       /product="polyprotein 1ab"
> and which has multiple, gigantic polyproteins, the child peptides
> towards whom we would actually love to focus our comparative
> bioinformatic scrutiny, but none of which have mappable IDs or seqs
> below the polyprotein level as can be seen:
>       mat_peptide     15118..16914	<===
>                       /product="nsp13"	<===
>                       /note="helicase"  <==these are all we have to go
> on
> and where are given no ID below the polyprotein level, no protein
> sequence...just positions. They are nuc positions at that, but we are
> given the complete polyprotein seq and have the components to do this
> on paper, but no code.  In summary we would like to dump in genbank
> files to a method, and get out fasta protein files which have some IDs
> and seqs.
>
> You guys are forcing me (thank god) to think critically and clearly
> about it too, so let me extend the proposed module or method as best I
> can:
>
> Input offending Virus Genbank file
> For each mat_peptide in a CDS
> get nuc coords e.g. 15118..161914
> translate it to aa
> if slippage then stop translating at end of frame 1 and hold for later
> now translate frame 2
> $full_translation = $translation_part_1 . $translation_part_2
> compare $full_translation to CDS translation e.g "/
> translation="MSSKHFKILVNE..."
> if identical subsequence then admit as the real, valid mat_peptide
> sequence
> Annotate with parent Unique ID+product name e.g. 112253723_helicase
> go to next mat_peptide
> concatenate products to a fasta amino acid file for whole virus
> isolate or CDS set for your virus
>
> PHEW.
>
> Chris L
>
> ==========
>>
>> I think one could use the full-length protein and run TFASTX (which
>> allows frameshifts) against the nucleotide sequence.  The output
>> will have the frameshifts designated with '/' or '\', so it would
>> then be a matter of splitting the sequence based on the midline,
>> then mapping those protein fragments back to the original sequence
>> coordinates.  Is this along the lines of what you mean?
>>
>> chris
>
> Let me look into this thank you CF, I have not used that in the past.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>