[BioPython] Bug in GenBank module - record.feature method ?

Sat Dec 24 12:49:40 EST 2005

Hi group, 
 I have been working to parse out the GO annotations
from FEATURE section of GenBank record. 

The GO annotations are incorporated into the CDS
section of FEATURES part. 

Here is my script:

from Bio import GenBank
from sets import Set
import glob

def FeatureGOparser(myseq):
    parser = GenBank.RecordParser()
    record = parser.parse(open(myseq,'r'))

    ## this gives the CDS in features CDS is second in
features section ###
    feature = record.features[2]

    golist = feature.qualifiers[1].value """extracts
'/Note' part
    go_list =  golist.split(';') #split by ';' to get
GO secs." 
    refseq = record.version   # getting NM_
    gene_name =
(feature.qualifiers[0].value).split('"')[1]
    go_comp = []
    go_func = []
    go_proc = []
   # from here trimming of GO list obtained above#
    for i in go_list:
        line = i.strip()
        if line.startswith('go_component'):
            iline = line.split('[')[0]
            com_line =
gene_name+'\t'+refseq+'\t'+iline
            go_comp.append(com_line)

    for m in go_list:
        line = m.strip()
        if line.startswith('go_function'):
            mline = line.split('[')[0]
            func_line =
gene_name+'\t'+refseq+'\t'+mline
            go_func.append(func_line)

    for j in go_list:
        line = i.strip()
        if line.startswith('go_process'):
            jline = line.split('[')[0]
            proc_line =
gene_name+'\t'+refseq+'\t'+jline
            go_proc.append(proc_line)

    unique_go_comp = Set(go_comp)
    unique_go_func = Set(go_func)
    unique_go_proc = Set(go_proc)
    for x in unique_go_comp:
        print x

    for y in unique_go_func:
        print y

    for z in unique_go_proc:
        print z

files = glob.glob('/home/seq/genbank/refseq/*')

def main():
    for each in files:
        FeatureGOparser(each)
main()

Error:

RGS3    NM_144489.2     go_component: membrane
RGS3    NM_144489.2     go_component: cytosol
RGS3    NM_144489.2     go_component: nucleus
RGS3    NM_144489.2     go_function: protein binding
RGS3    NM_144489.2     go_function: GTPase activator
activity
RGS3    NM_144489.2     go_function: signal transducer
activity
RGS3    NM_144489.2     go_process: regulation of
G-protein coupled receptor protein signaling pathway
FREQ    NM_014286.2     go_component: Golgi stack
FREQ    NM_014286.2     go_function: calcium ion
binding
DKFZp686O24166  NM_001009913.1  go_function:
structural molecule activity
Traceback (most recent call last):
  File "genbank_go_parser_ver2.py", line 58, in ?
    main()
  File "genbank_go_parser_ver2.py", line 57, in main
    FeatureGOparser(each)
  File "genbank_go_parser_ver2.py", line 11, in
FeatureGOparser
    feature = record.features[2]
IndexError: list index out of range

# feature = record.features[2] ###
this gives the CDS part.  record.features[2]. I see
that this order is not always true. For many sequences
in FEATURES section,  'gene' is always followed by
'CDS'. However in some new RefSeq sequences,
'variation' sub-section is incorporated now.  this is
the trouble, I guess. 

So there are two things, that I need some
help/suggestions/comments.

1. Is there any more technical way to parse '/note'
sub-section in CDS section of FEATURES.  Do you think
what I am doing (record.features[2]) is more novice
and not technical/correct. Please let me know what is
the best process.

2. If there any other way to parse GO annotations for
all RefSeq sequences in GenBank format. 

thanks
srini

__________________________________ 
Yahoo! for Good - Make a difference this year. 
http://brand.yahoo.com/cybergivingweek2005/