[Bioperl-l] need help with large genbank file

Tue, 23 Jul 2002 19:00:22 -0400

Dinakar,

The file is to big for perl to open a filehandle on (at least that is 
what your error message states)

I know from painful experience :) that the file you are trying to read 
is larger than 2GB when it is uncompressed into its native form.  If 
your computer, filesystem, kernel or operating system cannot handle 
files larger than 2GB in size then you will get these sorts of errors.

There are various tricks to make things work. Systems with 64-bit 
architectures (like Alphaservers) do not have these problems at all.

Linux solved this in the kernel a long time ago and the common linux 
filesystems can all handle large files. There are however binary 
programs that you may run into like 'cat', 'more', 'uncompress' etc. 
etc. that will coredump or segfault on large files because they were not 
compiled to support 64-bit offsets.

Without knowing your operating system or local configuration I'd 
recommend that you experiment with breaking NT into several smaller 
pieces. You should be able to determine experimentally the filesize 
limit that you appear to have.

-Chris

Dinakar Desai wrote:
> Hello:
> 
> I am new to perl and bioperl. I have downloaded file from ncbi 
> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I am 
> trying to parse this file for certain pattern with Bioperl. I get 
> error.I have looked into largefasta.pm and they suggest not to use it.
> I would appreciate, if you could help me with this problem.
> 
> My code to test only 5 records out of this big file is as follows:
> <code>
> #!/usr/bin/env perl
> 
> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
> 
> use Bio::SeqIO;
> 
> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' => 
> 'Fasta');
> 
> $seqobj = $seqio->next_seq();
> $count = 5;
> while ($count > 0){
>         print $seqobj->seq();
>         $seqobj = $seqio->next_seq();

> 
> }
> </code>
> and the error message is:
> <error>
> ------------ EXCEPTION  -------------
> MSG: Could not open /home/desas2/data/nt for reading: File too large
> STACK Bio::Root::IO::_initialize_io 
> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
> io/Root/IO.pm:244
> STACK Bio::SeqIO::_initialize 
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
> IO.pm:381
> STACK Bio::SeqIO::new 
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
> 4
> STACK Bio::SeqIO::new 
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
> 7
> STACK toplevel ./test_fasta.pl:8
> 
> --------------------------------------
> </error>
> 
> Do you have any suggestion, how I could get to read this big file and 
> get sequence object. I know how to manipulate sequence object.
> 
> Thank you.
> 
> Dinakar
> 

-- 
Chris Dagdigian, <dag@sonsorol.org>
Independent life science IT & research computing consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
Work: http://BioTeam.net PGP KeyID: 83D4310E  Yahoo IM: craffi