[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?

Tue Oct 27 15:54:04 EDT 2009

On Tue, Oct 27, 2009 at 7:15 PM, Chris Larsen <clarsen at vecna.com> wrote:
>
> Hello Peter!
>
> For instance, check this:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
> ...
>
> No mat_peptide sequence is given. We want that...

Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter