[BioPython] Bug in GenBank module - record.feature method ?
Srinivas Iyyer
srini_iyyer_bio at yahoo.com
Sat Dec 24 12:49:40 EST 2005
Hi group,
I have been working to parse out the GO annotations
from FEATURE section of GenBank record.
The GO annotations are incorporated into the CDS
section of FEATURES part.
Here is my script:
from Bio import GenBank
from sets import Set
import glob
def FeatureGOparser(myseq):
parser = GenBank.RecordParser()
record = parser.parse(open(myseq,'r'))
## this gives the CDS in features CDS is second in
features section ###
feature = record.features[2]
golist = feature.qualifiers[1].value """extracts
'/Note' part
go_list = golist.split(';') #split by ';' to get
GO secs."
refseq = record.version # getting NM_
gene_name =
(feature.qualifiers[0].value).split('"')[1]
go_comp = []
go_func = []
go_proc = []
# from here trimming of GO list obtained above#
for i in go_list:
line = i.strip()
if line.startswith('go_component'):
iline = line.split('[')[0]
com_line =
gene_name+'\t'+refseq+'\t'+iline
go_comp.append(com_line)
for m in go_list:
line = m.strip()
if line.startswith('go_function'):
mline = line.split('[')[0]
func_line =
gene_name+'\t'+refseq+'\t'+mline
go_func.append(func_line)
for j in go_list:
line = i.strip()
if line.startswith('go_process'):
jline = line.split('[')[0]
proc_line =
gene_name+'\t'+refseq+'\t'+jline
go_proc.append(proc_line)
unique_go_comp = Set(go_comp)
unique_go_func = Set(go_func)
unique_go_proc = Set(go_proc)
for x in unique_go_comp:
print x
for y in unique_go_func:
print y
for z in unique_go_proc:
print z
files = glob.glob('/home/seq/genbank/refseq/*')
def main():
for each in files:
FeatureGOparser(each)
main()
Error:
RGS3 NM_144489.2 go_component: membrane
RGS3 NM_144489.2 go_component: cytosol
RGS3 NM_144489.2 go_component: nucleus
RGS3 NM_144489.2 go_function: protein binding
RGS3 NM_144489.2 go_function: GTPase activator
activity
RGS3 NM_144489.2 go_function: signal transducer
activity
RGS3 NM_144489.2 go_process: regulation of
G-protein coupled receptor protein signaling pathway
FREQ NM_014286.2 go_component: Golgi stack
FREQ NM_014286.2 go_function: calcium ion
binding
DKFZp686O24166 NM_001009913.1 go_function:
structural molecule activity
Traceback (most recent call last):
File "genbank_go_parser_ver2.py", line 58, in ?
main()
File "genbank_go_parser_ver2.py", line 57, in main
FeatureGOparser(each)
File "genbank_go_parser_ver2.py", line 11, in
FeatureGOparser
feature = record.features[2]
IndexError: list index out of range
# feature = record.features[2] ###
this gives the CDS part. record.features[2]. I see
that this order is not always true. For many sequences
in FEATURES section, 'gene' is always followed by
'CDS'. However in some new RefSeq sequences,
'variation' sub-section is incorporated now. this is
the trouble, I guess.
So there are two things, that I need some
help/suggestions/comments.
1. Is there any more technical way to parse '/note'
sub-section in CDS section of FEATURES. Do you think
what I am doing (record.features[2]) is more novice
and not technical/correct. Please let me know what is
the best process.
2. If there any other way to parse GO annotations for
all RefSeq sequences in GenBank format.
thanks
srini
__________________________________
Yahoo! for Good - Make a difference this year.
http://brand.yahoo.com/cybergivingweek2005/
More information about the BioPython
mailing list