[BioPython] GenBank parsing errors

Wed Nov 24 14:35:20 EST 2004

Peter wrote:
>> For example, this small sample of code fails using E. coli K12,
>> file NC_000913.gbk (about 10MB) available from here:

(code removed)

>> I see CPU usage at almost 100%, and memory usage for Python 
>> goes steadily up.  At about 200 or 300MB the CPU usage drops, 
>> and my system becomes very sluggish.  I normally kill the 
>> process at this point.

Admin wrote:
> I have tried to use the Genbank or bacterial genomes in the past 
> but I had to abandon it because it thrashes around in memory as 
> you have described, as the sequence is just too large for the 
> API.  I was splicing out cds features from the records.
> 
> I had to write a custom parser to get the job done

Good to know its not just me or my computer :)

I have also resorted to writing my own custom script to do the job.

For each gene I wanted the translated sequence and the CDS 
information (i.e. position in genome), which are both fairly easy to 
get from the GenBank file.

I wrote some code to convert the GenBank file into a custom .faa 
FASTA file with the CDS location (and a few other properties like 
the product) encoded into the FASTA title records.

This lets me use all the BioPython FASTA support, and get the 
additional information by parsing the sequence title/description.

Peter