[BioPython] Help regarding multiple alignment
Sameet Mehta
sameet at nccs.res.in
Sat Jul 24 13:10:31 EDT 2004
Hi,
I had downloaded a set of sequences at different points after querying the
NCBI database the old fashioned way. I had stored the files individually
and then i copy-pasted the sequences such that all the sequences are not in
a single file. It seems that during this excercise there has been a lot of
repetition in the sequences. I want to curate these sequences such that i
take only single occurence of each sequence. I wrote a code for this whcih
i am giving here
<code>
from __future__ import division
import string
from Bio import Fasta
def curate(infile, outfile):
## inf = raw_input('Path to Fasta file: ')
## outfile = raw_input('Path to output file: ')
outf = open(outfile, 'w')
file = open(infile, 'r')
parser = Fasta.RecordParser()
iterator = Fasta.Iterator(file, parser)
titleList = []
t_d = {}
s = ' '
while 1:
cur_record = iterator.next()
if cur_record is None:
break
# Get the title set and validate to dict
title_atoms = string.split(cur_record.title)
title = title_atoms[0]
titleList.append(title)
titleList = find_same(titleList)
file = open(inf, 'r')
parser = Fasta.RecordParser()
iterator = Fasta.Iterator(file, parser)
while 1:
cur_record = iterator.next()
if cur_record is None:
break
title_atoms = string.split(cur_record.title)
title = title_atoms[0]
if title in titleList:
writestring = title+'\n'+cur_record.sequence+'\n\n'
outf.writelines(writestring)
outf.close()
print 'file written to ',outfile,'\n'
def find_same(title_list):
list = title_list
curated = []
repeated = []
for a in list:
if str(list).count(a) == 1:
curated.append(a)
else:
repeated.append(a)
return curated
print """Description of the program"""
inf = raw_input('Path to Fasta file: ')
outfile = raw_input('Path to output file: ')
curate(inf, outfile)
</code>
Now i understand that there would be a lot of redundunt code. However, the
output file i get doesnt seem to be an 'authentic' fasta file, because no
multiple alignment programs are recognizing it. I could have done something
really wrong, but i am not able to figure that out! All the heelp is highly
solicitated.
regards
Thaking you in anticipation
Sameet
--
National Centre for Cell Science, Pune
More information about the BioPython
mailing list