[Biopython] Extracting data genpept files

Tue Nov 23 03:52:15 UTC 2010

Hello all,

    I think Peter pointed me to part of this code (shown below) for  
extracting data out of a genpept file. I am trying to get a handle on  
the formating end of things. My questions is when there is missing  
taxonomic data grabbed by   tax_records =  
gb_record.annotations["taxonomy"] instead of leaving the space blank  
the program fills it in with the next piece of data, usually the date.  
This throws off the whole spreadsheet when I import as a CSV file.

   Is there a way to have the program write in white space when it  
encounters missing data instead of the date?

Thanks,
Ara

PS as soon as the formating is sorted out and folks created for input  
and such I will post the code up here.

gg = open("raw_genbank.txt","w")
gb_file = "sequence.gp.txt"

for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"):
     gb_feature = gb_record.features[2]

     def index_genbank_features(gb_record, feature_type, qualifier) :
         answer = dict()
         for (index, feature) in enumerate(gb_record.features) :
             if feature.type==feature_type :
                 if qualifier in feature.qualifiers :
                     for value in feature.qualifiers[qualifier] :
                         if value in answer :
                             print "WARNING - Duplicate key %s for %s  
features %i and %i" \
                                % (value, feature_type, answer[value],  
index)
                         else :
                             answer[value] = index
         return answer

     locus_tag_cds_index =  
index_genbank_features(gb_record,"CDS","locus_tag")
     coded_by_cds_index =  
index_genbank_features(gb_record,"CDS","coded_by")
     name_by_source_index =  
index_genbank_features(gb_record,"source","organism")
     protein_id_cds_index =  
index_genbank_features(gb_record,"CDS","protein_id")

     gb_annotations = gb_record.annotations
     tax_records = gb_record.annotations["taxonomy"]
     accession = gb_record.annotations["accessions"]
     date = gb_record.annotations["date"]
     function = gb_record.description

     gg.write(str([accession, locus_tag_cds_index, coded_by_cds_index,  
name_by_source_index, tax_records, date, function]))
     gg.write("\n")
gg.close()