[BioPython] GenBank parsing errors
Peter
biopython at maubp.freeserve.co.uk
Wed Nov 24 14:35:20 EST 2004
Peter wrote:
>> For example, this small sample of code fails using E. coli K12,
>> file NC_000913.gbk (about 10MB) available from here:
(code removed)
>> I see CPU usage at almost 100%, and memory usage for Python
>> goes steadily up. At about 200 or 300MB the CPU usage drops,
>> and my system becomes very sluggish. I normally kill the
>> process at this point.
Admin wrote:
> I have tried to use the Genbank or bacterial genomes in the past
> but I had to abandon it because it thrashes around in memory as
> you have described, as the sequence is just too large for the
> API. I was splicing out cds features from the records.
>
> I had to write a custom parser to get the job done
Good to know its not just me or my computer :)
I have also resorted to writing my own custom script to do the job.
For each gene I wanted the translated sequence and the CDS
information (i.e. position in genome), which are both fairly easy to
get from the GenBank file.
I wrote some code to convert the GenBank file into a custom .faa
FASTA file with the CDS location (and a few other properties like
the product) encoded into the FASTA title records.
This lets me use all the BioPython FASTA support, and get the
additional information by parsing the sequence title/description.
Peter
More information about the BioPython
mailing list