[Biopython] identify triplet sequences

Wed Jun 29 17:54:56 UTC 2011

Hi Wibowo,
Thank you for your answer,
If I want to classify only based on 1st and 3rd charecters of the triplets
while allowing the central charecter to be anything/vary can I do that?
so  any of the following should be under one list..instead of being counted
as unique.
ACA
ARA
AEA
AGA
...etc
You are right I don't need all 8000 combinations, just the one's that occur
in the list.
Thank you,
George

On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi George,
>
> This is my first post, so greetings to everyone else as well. For your
> question, do you need to name all 8000 combinations? If not, then you can
> use a dictionary to enumerate the occurence of each amino acid triplet. You
> don't have to take into account all possibilities, just the one you find in
> your sequences.
>
> I've made a somewhat short & dirty script to do the analysis you want. It
> also generates a fasta file containing random amino acid sequences of a
> certain length as a source for a demo analysis. Here it is:
>
> #!/usr/bin/env python
>
> import random
>
> from Bio import SeqIO
> from Bio.Seq import Seq
> from Bio.SeqRecord import SeqRecord
> from Bio.Alphabet import IUPAC
>
> # function to generate random protein sequence
> def random_prot(length):
>     seq = ''
>     for i in xrange(length):
>         seq += random.choice(IUPAC.protein.letters)
>     return seq
>
> # function to generate list of SeqRecord objects
> def seqrecord_gen(num, length):
>     seqs = []
>     for i in xrange(num):
>         seqs.append(SeqRecord(Seq(random_prot(length)),
>                     id='fasta'+str(i+1),
>                     name='',
>                     description=''))
>     SeqIO.write(seqs, 'random_proteins.fa', 'fasta')
>
> # function to read fasta file and count triplets
> def count_triplet(source):
>     triplets = {}
>     seqs = SeqIO.parse(source, 'fasta')
>
>     for rec in seqs:
>         step = 0
>         while step + 3 <= len(rec.seq.tostring()):
>             tri = rec.seq.tostring()[0+step:3+step]
>             if tri not in triplets:
>                 triplets[tri] = 1
>             else:
>                 triplets[tri] += 1
>             step += 1
>
>     with open('results', 'w') as output:
>         for key in sorted(triplets, key=triplets.get, reverse=True):
>             output.writelines("{0}: {1}\n".format(key, triplets[key]))
>
> # generate mock file
> seqrecord_gen(100, 30)
> # count the triplet
> count_triplet('random_proteins.fa')
>
>
> You can also see the script here: https://gist.github.com/1054348 (in case
> there are formatting problems with the mail's display). Just remove the
> seqrecord_gen() call and replace run count_triplet() with your fasta file
> name as the argument. You can see the output in 'results'
>
> Hope that helps!
> Wibowo Arindrarto (Bow)
>
>
> On Wed, Jun 29, 2011 at 18:15, George Devaniranjan <devaniranjan at gmail.com
> > wrote:
>
>> Hi,
>>
>> Not sure if this is a python or  bio-python question -but suggestions are
>> most welcome.
>>
>> I have some FASTA sequences....like
>> AAAAWWWHHHHH
>> TTTYYYYYHGGGG
>> NNNNNGGGGFFFF
>>
>> I extract from each sequence triplets moving from 1st residue and
>> extracting
>> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as
>> another triplet ...ect
>> So for the 1st sequence given above.....
>> AAA
>> AAA
>> AAW
>> AWW
>> .
>> .
>> .
>> so on.....
>>
>> Now my question for 20amino acids there will be 8000 possible unique
>> combinations (20^3)
>>
>> How can I classify them using python/biopython and write them out to 8000
>> unique text files .....is there a way to classify them without writing
>> 8000
>> IF/ELSIF statements?
>> I want to see which sets of triplets has the hightest occourence.
>>
>> Thank you.
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>