[BioPython] Help regarding multiple alignment

Sameet Mehta sameet at nccs.res.in
Sat Jul 24 13:10:31 EDT 2004


Hi,
I had downloaded a set of sequences at different points after querying the 
NCBI database the old fashioned way.  I had stored the files individually 
and then i copy-pasted the sequences such that all the sequences are not in 
a single file.  It seems that during this excercise there has been a lot of 
repetition in the sequences.  I want to curate these sequences such that i 
take only single occurence of each sequence.  I wrote a code for this whcih 
i am giving here

<code>
from __future__ import division
import string
from Bio import Fasta

def curate(infile, outfile):
##    inf = raw_input('Path to Fasta file: ')
##    outfile = raw_input('Path to output file: ')

    outf = open(outfile, 'w')
    file = open(infile, 'r')
    
    parser = Fasta.RecordParser()
    iterator = Fasta.Iterator(file, parser)
    titleList = []
    t_d = {}
    s = ' '
    while 1:
        cur_record = iterator.next()
        if cur_record is None:
            break
        # Get the title set and validate to dict
        title_atoms = string.split(cur_record.title)
        title = title_atoms[0]
        titleList.append(title)
        
    titleList = find_same(titleList)
    file = open(inf, 'r')
    parser = Fasta.RecordParser()
    iterator = Fasta.Iterator(file, parser)
    while 1:
        cur_record = iterator.next()
        if cur_record is None:
            break
        title_atoms = string.split(cur_record.title)
        title = title_atoms[0]
        if title in titleList:
            writestring = title+'\n'+cur_record.sequence+'\n\n'
            outf.writelines(writestring)
    outf.close()
    print 'file written to ',outfile,'\n'


def find_same(title_list):

    list = title_list
    curated = []
    repeated = []
    for a in list:
        if str(list).count(a) == 1:
            curated.append(a)
        else:
            repeated.append(a)

    return curated

print """Description of the program"""

inf = raw_input('Path to Fasta file: ')
outfile = raw_input('Path to output file: ')
curate(inf, outfile)
</code>

Now i understand that there would be a lot of redundunt code.  However, the 
output file i get doesnt seem to be an 'authentic' fasta file, because no 
multiple alignment programs are recognizing it.  I could have done something 
really wrong, but i am not able to figure that out! All the heelp is highly 
solicitated.

regards
Thaking you in anticipation

Sameet


--
National Centre for Cell Science, Pune



More information about the BioPython mailing list