[Biopython] identify triplet sequences

Wibowo Arindrarto w.arindrarto at gmail.com
Wed Jun 29 18:12:28 UTC 2011


Hi George,

That can be done by modifying this line:

tri = rec.seq.tostring()[0+step:3+step]

into this:

tri = rec.seq.tostring()[0+step:3+step:2]

Alternatively, if you want a prettier output ('A*A' instead of 'AA' for all
triplets starting and ending with 'A') you can replace it with these
instead:

tri = rec.seq.tostring()[0+step]
tri += '*'
tri += rec.seq.tostring()[2+step]

Hope that helps!
Bow


On Wed, Jun 29, 2011 at 19:54, George Devaniranjan
<devaniranjan at gmail.com>wrote:

> Hi Wibowo,
> Thank you for your answer,
> If I want to classify only based on 1st and 3rd charecters of the triplets
> while allowing the central charecter to be anything/vary can I do that?
> so  any of the following should be under one list..instead of being counted
> as unique.
> ACA
> ARA
> AEA
> AGA
> ...etc
> You are right I don't need all 8000 combinations, just the one's that occur
> in the list.
> Thank you,
> George
>
>
> On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto <w.arindrarto at gmail.com
> > wrote:
>
>> Hi George,
>>
>> This is my first post, so greetings to everyone else as well. For your
>> question, do you need to name all 8000 combinations? If not, then you can
>> use a dictionary to enumerate the occurence of each amino acid triplet. You
>> don't have to take into account all possibilities, just the one you find in
>> your sequences.
>>
>> I've made a somewhat short & dirty script to do the analysis you want. It
>> also generates a fasta file containing random amino acid sequences of a
>> certain length as a source for a demo analysis. Here it is:
>>
>> #!/usr/bin/env python
>>
>> import random
>>
>> from Bio import SeqIO
>> from Bio.Seq import Seq
>> from Bio.SeqRecord import SeqRecord
>> from Bio.Alphabet import IUPAC
>>
>> # function to generate random protein sequence
>> def random_prot(length):
>>     seq = ''
>>     for i in xrange(length):
>>         seq += random.choice(IUPAC.protein.letters)
>>     return seq
>>
>> # function to generate list of SeqRecord objects
>> def seqrecord_gen(num, length):
>>     seqs = []
>>     for i in xrange(num):
>>         seqs.append(SeqRecord(Seq(random_prot(length)),
>>                     id='fasta'+str(i+1),
>>                     name='',
>>                     description=''))
>>     SeqIO.write(seqs, 'random_proteins.fa', 'fasta')
>>
>> # function to read fasta file and count triplets
>> def count_triplet(source):
>>     triplets = {}
>>     seqs = SeqIO.parse(source, 'fasta')
>>
>>     for rec in seqs:
>>         step = 0
>>         while step + 3 <= len(rec.seq.tostring()):
>>             tri = rec.seq.tostring()[0+step:3+step]
>>             if tri not in triplets:
>>                 triplets[tri] = 1
>>             else:
>>                 triplets[tri] += 1
>>             step += 1
>>
>>     with open('results', 'w') as output:
>>         for key in sorted(triplets, key=triplets.get, reverse=True):
>>             output.writelines("{0}: {1}\n".format(key, triplets[key]))
>>
>> # generate mock file
>> seqrecord_gen(100, 30)
>> # count the triplet
>> count_triplet('random_proteins.fa')
>>
>>
>> You can also see the script here: https://gist.github.com/1054348 (in
>> case there are formatting problems with the mail's display). Just remove the
>> seqrecord_gen() call and replace run count_triplet() with your fasta file
>> name as the argument. You can see the output in 'results'
>>
>> Hope that helps!
>> Wibowo Arindrarto (Bow)
>>
>>
>> On Wed, Jun 29, 2011 at 18:15, George Devaniranjan <
>> devaniranjan at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Not sure if this is a python or  bio-python question -but suggestions are
>>> most welcome.
>>>
>>> I have some FASTA sequences....like
>>> AAAAWWWHHHHH
>>> TTTYYYYYHGGGG
>>> NNNNNGGGGFFFF
>>>
>>> I extract from each sequence triplets moving from 1st residue and
>>> extracting
>>> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as
>>> another triplet ...ect
>>> So for the 1st sequence given above.....
>>> AAA
>>> AAA
>>> AAW
>>> AWW
>>> .
>>> .
>>> .
>>> so on.....
>>>
>>> Now my question for 20amino acids there will be 8000 possible unique
>>> combinations (20^3)
>>>
>>> How can I classify them using python/biopython and write them out to 8000
>>> unique text files .....is there a way to classify them without writing
>>> 8000
>>> IF/ELSIF statements?
>>> I want to see which sets of triplets has the hightest occourence.
>>>
>>> Thank you.
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>
>>
>



More information about the Biopython mailing list