[Bioperl-l] dealing with large files

Stefano Ghignone ste.ghi at libero.it
Thu Dec 20 13:57:54 UTC 2007


I was wandering if, working with so big FILE, should be better first index the database, than query it formatting the sequences as one want...

> It gets buffered via the OS -- Bio::Root::IO calls next_line  
> iteratively, but eventually the whole sequence object will get put  
> into RAM as it is built up.
> zcat or bzcat can also be used for gzipped and bzipped files  
> respectively, I like to use this where I want to disk space footprint  
> down.
> 
> Because we treat data input usually as from a stream ignoring whether  
> it is in a file or not, we have to have a more flexible structure to  
> really handle this, although I'd argue the data really belongs in a  
> database when it is too big for memory.
> More compact Feature/Location objects would probably also help here.   
> I would not be surprised if the memory requirement has more to do  
> with the number of features than length of the sequence - human chrom  
> 1 can fit into memory just fine on most machines with 2GB of RAM.
> 
> But it would require someone taking an interest in some re- 
> architecting here.
> 
> -jason
> 
> On Dec 19, 2007, at 9:59 PM, Michael Thon wrote:
> 
> >
> > On Dec 18, 2007, at 7:04 PM, Stefano Ghignone wrote:
> >
> >> my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", - 
> >> format => 'EMBL');
> >
> > This is just for the sake of curiosity, since you already found a  
> > solution to your problem, but I wonder how perl will handle a file  
> > opened this way.  Will it try to suck the whole thing into ram in  
> > one go?
> >
> > Mike
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 





More information about the Bioperl-l mailing list