[BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences

Thu Jan 15 13:02:05 EST 2009

On Thu, Jan 15, 2009 at 2:26 PM, Giovanni Marco Dall'Olio
<dalloliogm at fastwebnet.it> wrote:
> Let's see how can you do this with biopython (ehi, Peter, please
> correct me if I say something wrong!! :)).
>
> ...
>
> You will be able to access all the sequences in your alignment by the
> _records property of AlignIO:

In Python anything starting with a single underscore is considered to
be a private variable, and you should avoid using it.  So you
shouldn't be doing alignment._records, and if you do, don't complain
if this implementation detail changes in a future version of
Biopython.

For the Alignment object, if you really want a list of SeqRecord
objects you should use alignment.get_all_seqs() instead.  Ugly I agree
- but in practice you don't need to do this so often.  You can use the
alignment object itself, e.g.

first_record = alignment[0]
last_record = alignment[-1]
for record in alignment :
   print record

I think there is still room for improvement to this bit of Biopython,
and there are a couple of open enhancement bugs.

If you are interested, here my quick solution for solving this code
using Bio.AlignIO is one way of solving Animesh's question.  First of
all, we need the alignment in a suitable file format (e.g. FASTA,
ClustalW, PHYLIP, Stockholm etc).  e.g. taking Animesh's example
alignment with four sequences of length nine:

handle = open("my_example.fasta","w")
handle.write(""">Alpha
GCT GCT ATT
>Beta
GCC GCC ATC
>Gamma
GCA GCA ATA
>Delta
GCG GCG ATT""")
handle.close()

Here is one solution using Bio.AlignIO to read in this as an alignment
object, and then count the codon usage at each position separately:

from Bio import AlignIO
#Change this next line if your real file is another file format:
alignment = AlignIO.read(open("my_example.fasta"),"fasta")

assert alignment.get_alignment_length() % 3 == 0, \
       "Alignment length is not a multiple of three!"

number_of_codons = int(alignment.get_alignment_length() / 3)
for codon_index in range(number_of_codons) :
    #Count the codons in a dictionary using upper case codons as keys
    counts = dict()
    for record in alignment :
        #In case the alignment is in mixed case, make everything upper case
        codon = str(record.seq[codon_index*3:codon_index*3+3]).upper()
        #Assuming you want to exclude gaps when calculating frequencies:
        if codon=="---" : continue
        #Increment the count by one, defaulting to zero
        #(there are lots of ways to write this code!)
        counts[codon] = counts.get(codon,0)+1

    #Turn the counts in frequencies - note that because I exclude gaps,
    #the total codon count can vary across the alignment.
    total = float(sum(counts.values()))
    freqs = dict((codon,count/total) for (codon,count) in counts.iteritems())
    print "Codon frequencies for columns %i to %i:" \
          % (codon_index*3+1,codon_index*3+3),
    print freqs

And the output should read:

Codon frequencies for columns 1 to 3: {'GCA': 0.25, 'GCC': 0.25,
'GCT': 0.25, 'GCG': 0.25}
Codon frequencies for columns 4 to 6: {'GCA': 0.25, 'GCC': 0.25,
'GCT': 0.25, 'GCG': 0.25}
Codon frequencies for columns 7 to 9: {'ATT': 0.5, 'ATC': 0.25, 'ATA': 0.25}

Peter