[BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences

Thu Jan 15 13:34:52 UTC 2009

On Thu, Jan 15, 2009 at 1:21 PM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
>
> Hi Marco,
> My apologies. Probably in my last mail I didn't make myself very clear.
> I have a protein which is about 475 amino acid long and is highly
> conserved (over 95%) among diffrent organisms. I have downloaded
> its CDS(coding sequence) .
> I would like to calculate codon use frequenecy for important amino acid
> positions as you have put it very nicely in your reply:
> "for a particular aminoacid position (e.g. the first, or the third,or the last)
> the codon usage for those aminoacids that are coded by more than a
> possible codon (e.g. Ala) the frequency with which every codon is used?"
> For example in a set of four sequenecs
>                 1       2       3
>               Ala    Gly     Ile
> Seq1 GCT GCT ATT
> Seq2 GCC GCC ATC
> Seq3 GCA GCA ATA
> Seq4 GCG GCG ATT
>
> For first amino acid position i.e. Ala (which is coded by 4 codons) each
> codon is used once in 4 sequences that gives you frequency of 0.25 for
> each codon or for third  amino acid position i.e. Ile ( which is coded by 3
> codons) the  ATT will give you frequency of 0.5 while other two will give
> you frequency of 0.25.

OK - first of all you will need to create an alignment of all the
different CDS sequences.  If they happen to be the same length this is
easy.  Otherwise, you'll want to align their PROTEIN sequences, and
then turn this into a nucleotide sequence alignment (where gaps are
only found as triples).  You may be lucky and find the proteins all
align beautifully with no gaps.  Do you need advice on this step?

Once you have the alignment file, it should be fairly trivial to count
the codons in each set of three columns.

Peter