[BioPython] Large GenBank files: impossible to handle?

Tue Jan 3 13:57:01 EST 2006

Hans Meier wrote:
>  Dear friends,
>  
>  I tried to handle a .gbk file of 4,7MB in size
>  with a "700MHz, Pentium III, 256 MB RAM"-box.

This might work with the current release (BioPython 1.41) but will use a 
lot of memory - I would guess about 250MB, which is all your machine 
has.  This is a limitation of the old Martel based parser.

You should be able to install the new GenBank parser (from CVS) which I 
wrote specifically due to problems with large GenBank files.

Ask if you need help with this - and are you on Windows or Linux?

See also bug 1747, http://bugzilla.open-bio.org/show_bug.cgi?id=1747

>  Parsing with "RecordParser" and indexing with "index_file"
>  crushed the machine in both cases, I had to reboot
>  (what happens not so often with Debian).

I would avoid using index_file on large GenBank files - this still uses 
Martel and can be rather slow.  Also, I strongly suspect your files have 
a single record each (i.e. only one LOCUS line) in which case there is 
no need to index them.

>  My final goal is to access the .gbk file somehow like a database.

Have you tried using the FeatureParser and then accessing the .features 
list property of the record?

>  The alternative would be to use .fna,.faa and .fnn files 
>  and write my own methods. Or stuff all the data in a SQL-database.
>  But I still hope that Biopython could help.
>  
>  Before I spend more time on this, I'd like to ask you:
>  
>  With the Biopython tools, is it possible to handle
>  .gbk files of about 5MB in a reasonable time with 
>  a low- to middle-class desktop computer? If so, how?

Using the latest BioPython code it should be easy (see above).

Also, these two recent examples might be handy:

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank2fasta/

Peter