[BioPython] Large GenBank files: impossible to handle?
Peter
biopython at maubp.freeserve.co.uk
Tue Jan 3 13:57:01 EST 2006
Hans Meier wrote:
> Dear friends,
>
> I tried to handle a .gbk file of 4,7MB in size
> with a "700MHz, Pentium III, 256 MB RAM"-box.
This might work with the current release (BioPython 1.41) but will use a
lot of memory - I would guess about 250MB, which is all your machine
has. This is a limitation of the old Martel based parser.
You should be able to install the new GenBank parser (from CVS) which I
wrote specifically due to problems with large GenBank files.
Ask if you need help with this - and are you on Windows or Linux?
See also bug 1747, http://bugzilla.open-bio.org/show_bug.cgi?id=1747
> Parsing with "RecordParser" and indexing with "index_file"
> crushed the machine in both cases, I had to reboot
> (what happens not so often with Debian).
I would avoid using index_file on large GenBank files - this still uses
Martel and can be rather slow. Also, I strongly suspect your files have
a single record each (i.e. only one LOCUS line) in which case there is
no need to index them.
> My final goal is to access the .gbk file somehow like a database.
Have you tried using the FeatureParser and then accessing the .features
list property of the record?
> The alternative would be to use .fna,.faa and .fnn files
> and write my own methods. Or stuff all the data in a SQL-database.
> But I still hope that Biopython could help.
>
> Before I spend more time on this, I'd like to ask you:
>
> With the Biopython tools, is it possible to handle
> .gbk files of about 5MB in a reasonable time with
> a low- to middle-class desktop computer? If so, how?
Using the latest BioPython code it should be easy (see above).
Also, these two recent examples might be handy:
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank2fasta/
Peter
More information about the BioPython
mailing list