[BioPython] Large GenBank files: impossible to handle?
    Peter 
    biopython at maubp.freeserve.co.uk
       
    Tue Jan  3 13:57:01 EST 2006
    
    
  
Hans Meier wrote:
>  Dear friends,
>  
>  I tried to handle a .gbk file of 4,7MB in size
>  with a "700MHz, Pentium III, 256 MB RAM"-box.
This might work with the current release (BioPython 1.41) but will use a 
lot of memory - I would guess about 250MB, which is all your machine 
has.  This is a limitation of the old Martel based parser.
You should be able to install the new GenBank parser (from CVS) which I 
wrote specifically due to problems with large GenBank files.
Ask if you need help with this - and are you on Windows or Linux?
See also bug 1747, http://bugzilla.open-bio.org/show_bug.cgi?id=1747
>  Parsing with "RecordParser" and indexing with "index_file"
>  crushed the machine in both cases, I had to reboot
>  (what happens not so often with Debian).
I would avoid using index_file on large GenBank files - this still uses 
Martel and can be rather slow.  Also, I strongly suspect your files have 
a single record each (i.e. only one LOCUS line) in which case there is 
no need to index them.
>  My final goal is to access the .gbk file somehow like a database.
Have you tried using the FeatureParser and then accessing the .features 
list property of the record?
>  The alternative would be to use .fna,.faa and .fnn files 
>  and write my own methods. Or stuff all the data in a SQL-database.
>  But I still hope that Biopython could help.
>  
>  Before I spend more time on this, I'd like to ask you:
>  
>  With the Biopython tools, is it possible to handle
>  .gbk files of about 5MB in a reasonable time with 
>  a low- to middle-class desktop computer? If so, how?
Using the latest BioPython code it should be easy (see above).
Also, these two recent examples might be handy:
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank2fasta/
Peter
    
    
More information about the BioPython
mailing list