[Bioperl-l] Limit on sequence file size fetches?

Sun Aug 16 19:16:09 UTC 2009

Hello,

I am trying to use get_sequence() to fetch the sequence NS_000198 for the
fungus *Podospora anserina* with the databases "GenBank" and when that
didn't work "Gene".  This is a simple script which fetches the sequence then
writes out the fasta and genbank files from the data structure.

The errors I got suggested that the system was running out of memory which I
thought was unlikely since I've got something like 3GB of main memory and
9GB of swap space.  After running strace on the script (which takes a while)
I determined that the brk() calls were generating ENOMEM at ~3GB.  This
turns out to be due to the limit of the Linux memory model I am using
(3GB/1GB) on a Pentium IV (Prescott).

Now, I think the total genome size for the fungus is ~70MB but haven't
verified this so I "should" be able to fetch it unless Bioperl (or perl
itself) is doing extremely poor memory management (perhaps not coalescing
memory segments into one large sequence) as the reads take place? [1].

Has anyone encountered this problem (fetching say large mammalian
chromosomes)?  Does anyone know what the limits are for "fetching"  sequence
files (on 32/64 bit machines?.  The reason I am using get_sequence and
BioPerl is that I can't seem to find the *Podospora anserina* sequence in a
FTP database anywhere (so I can't use "wget or ftp").  I haven't tested
accessing the GenBank file in a browser (I don't know what browsers would do
with a HTML file that large but suspect it would not be pretty).

Thanks in advance,
Robert Bradbury

1. The strace seems to indicate periodic brk() calls to expand the process
data segment size between which there are lots of read() calls of size 4096,
presumably reading the socket from NCBI.  I don't know if there is an easy
way to trace perl's memory allocation/manipulation at a higher level.