[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?
Chris Fields
cjfields at illinois.edu
Tue Oct 27 16:46:05 EDT 2009
On Oct 27, 2009, at 3:17 PM, Peter wrote:
> On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen at vecna.com>
> wrote:
>>
>> Peter,
>>
>> This is a good strategy when the gi is given. However I failed to
>> mention
>> that we are finding the example I gave is unusual (15%?)---most virus
>> 'mature peptides' we will apply this analysis to do not in fact
>> have a gi
>> number or unique identifier associated with them. There are
>> thousands of
>> dengue virus files to be processed to give mature proteins.
>>
>> Should have mentioned this...Hence the problem--we cant look it up
>> because
>> only the parent polyprotein has a gi. Theres nothing to look up /
>> by/ in most
>> cases. So we still have to build a set of proteins that are cleaved
>> out of
>> every polyprotein, by local and high throughput methods, by
>> building it out
>> of the available information (sadly, kind of a run around-- it
>> should be in
>> the genbank entry).
>>
>> Chris
>
> Ah. That's a shame. I did just take a few minutes to try out the
> EFetch idea (using Biopython) and it does work beautifully for
> this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for
the mat_peptide feature key, they relate back to the full protein
sequence with from/to, not to the protein_id for the feature. Example:
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
This record doesn't appear to contain any mapping information along
those lines, which makes me think this is an autogenerated record
using the Gene record, which does have those mappings:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
> I also note that in the example given, all the mature peptides
> have nice and simple locations (in terms of their co-ordindates
> for the nucleotides), no ribosomal slippages etc. This means
> grabbing the relevant bits of the genome and translating it is
> also pretty easy (option 2 in your original email).
>
> Have you got a more typical entry you can point us at?
>
> If there is nothing publicly available, I wouldn't mind you
> emailing me one or two to look at off list (and if don't mind,
> they might make good examples for Bio* project unit tests
> or examples).
>
> Peter
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output will
have the frameshifts designated with '/' or '\', so it would then be a
matter of splitting the sequence based on the midline, then mapping
those protein fragments back to the original sequence coordinates. Is
this along the lines of what you mean?
chris
More information about the Bioperl-l
mailing list