[Biopython] identify triplet sequences

Wed Jun 29 17:18:49 UTC 2011

Hi George,

This is my first post, so greetings to everyone else as well. For your
question, do you need to name all 8000 combinations? If not, then you can
use a dictionary to enumerate the occurence of each amino acid triplet. You
don't have to take into account all possibilities, just the one you find in
your sequences.

I've made a somewhat short & dirty script to do the analysis you want. It
also generates a fasta file containing random amino acid sequences of a
certain length as a source for a demo analysis. Here it is:

#!/usr/bin/env python

import random

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC

# function to generate random protein sequence
def random_prot(length):
    seq = ''
    for i in xrange(length):
        seq += random.choice(IUPAC.protein.letters)
    return seq

# function to generate list of SeqRecord objects
def seqrecord_gen(num, length):
    seqs = []
    for i in xrange(num):
        seqs.append(SeqRecord(Seq(random_prot(length)),
                    id='fasta'+str(i+1),
                    name='',
                    description=''))
    SeqIO.write(seqs, 'random_proteins.fa', 'fasta')

# function to read fasta file and count triplets
def count_triplet(source):
    triplets = {}
    seqs = SeqIO.parse(source, 'fasta')

    for rec in seqs:
        step = 0
        while step + 3 <= len(rec.seq.tostring()):
            tri = rec.seq.tostring()[0+step:3+step]
            if tri not in triplets:
                triplets[tri] = 1
            else:
                triplets[tri] += 1
            step += 1

    with open('results', 'w') as output:
        for key in sorted(triplets, key=triplets.get, reverse=True):
            output.writelines("{0}: {1}\n".format(key, triplets[key]))

# generate mock file
seqrecord_gen(100, 30)
# count the triplet
count_triplet('random_proteins.fa')

You can also see the script here: https://gist.github.com/1054348 (in case
there are formatting problems with the mail's display). Just remove the
seqrecord_gen() call and replace run count_triplet() with your fasta file
name as the argument. You can see the output in 'results'

Hope that helps!
Wibowo Arindrarto (Bow)

On Wed, Jun 29, 2011 at 18:15, George Devaniranjan
<devaniranjan at gmail.com>wrote:

> Hi,
>
> Not sure if this is a python or  bio-python question -but suggestions are
> most welcome.
>
> I have some FASTA sequences....like
> AAAAWWWHHHHH
> TTTYYYYYHGGGG
> NNNNNGGGGFFFF
>
> I extract from each sequence triplets moving from 1st residue and
> extracting
> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as
> another triplet ...ect
> So for the 1st sequence given above.....
> AAA
> AAA
> AAW
> AWW
> .
> .
> .
> so on.....
>
> Now my question for 20amino acids there will be 8000 possible unique
> combinations (20^3)
>
> How can I classify them using python/biopython and write them out to 8000
> unique text files .....is there a way to classify them without writing 8000
> IF/ELSIF statements?
> I want to see which sets of triplets has the hightest occourence.
>
> Thank you.
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>