[Bioperl-l] need help with large genbank file

Dinakar Desai Desai.Dinakar@mayo.edu
Tue, 23 Jul 2002 18:10:49 -0500


Chris Dagdigian wrote:
> 
> Dinakar,
> 
> The file is to big for perl to open a filehandle on (at least that is 
> what your error message states)
> 
> I know from painful experience :) that the file you are trying to read 
> is larger than 2GB when it is uncompressed into its native form.  If 
> your computer, filesystem, kernel or operating system cannot handle 
> files larger than 2GB in size then you will get these sorts of errors.
> 
> There are various tricks to make things work. Systems with 64-bit 
> architectures (like Alphaservers) do not have these problems at all.
> 
> Linux solved this in the kernel a long time ago and the common linux 
> filesystems can all handle large files. There are however binary 
> programs that you may run into like 'cat', 'more', 'uncompress' etc. 
> etc. that will coredump or segfault on large files because they were not 
> compiled to support 64-bit offsets.
> 
> Without knowing your operating system or local configuration I'd 
> recommend that you experiment with breaking NT into several smaller 
> pieces. You should be able to determine experimentally the filesize 
> limit that you appear to have.
> 
> -Chris
> 
> 
> 
> 
> Dinakar Desai wrote:
> 
>> Hello:
>>
>> I am new to perl and bioperl. I have downloaded file from ncbi 
>> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I 
>> am trying to parse this file for certain pattern with Bioperl. I get 
>> error.I have looked into largefasta.pm and they suggest not to use it.
>> I would appreciate, if you could help me with this problem.
>>
>> My code to test only 5 records out of this big file is as follows:
>> <code>
>> #!/usr/bin/env perl
>>
>> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
>>
>> use Bio::SeqIO;
>>
>> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' => 
>> 'Fasta');
>>
>> $seqobj = $seqio->next_seq();
>> $count = 5;
>> while ($count > 0){
>>         print $seqobj->seq();
>>         $seqobj = $seqio->next_seq();
> 
> 
> 
>>
>> }
>> </code>
>> and the error message is:
>> <error>
>> ------------ EXCEPTION  -------------
>> MSG: Could not open /home/desas2/data/nt for reading: File too large
>> STACK Bio::Root::IO::_initialize_io 
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
>> io/Root/IO.pm:244
>> STACK Bio::SeqIO::_initialize 
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
>> IO.pm:381
>> STACK Bio::SeqIO::new 
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
>> 4
>> STACK Bio::SeqIO::new 
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
>> 7
>> STACK toplevel ./test_fasta.pl:8
>>
>> --------------------------------------
>> </error>
>>
>> Do you have any suggestion, how I could get to read this big file and 
>> get sequence object. I know how to manipulate sequence object.
>>
>> Thank you.
>>
>> Dinakar
>>
> 
> 
> 

Thank you very much for your email. I am running this script on :
Linux  2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown
it has about 2.5 GB memory.

I used Biopython and I could open file and do some work. I thought I 
will try bioperl (which seems to more mature) and I got in to this problem.

The size of file is: 6298460844 bytes (6.2 GB)

Can you suggest how I can break this file into smaller files and then 
parse them.



Thank you.

Dinakar

-- 

Dinakar Desai, Ph.D
perl -e '$_ = "mqonx.zako\@ude";$_=~ tr /qnxzk\@.ue/npqmy.\@eu/; print'
----------------------

Everything should be made as simple as possible, but no 
simpler.-----Albert Einstein