[Bioperl-l] Bio::DB::Fasta fails for files over 4GB

Chris Fields cjfields at uiuc.edu
Mon Aug 7 17:43:01 UTC 2006


Dynamically determining the packing based on file size is probably  
the way to go; it would be nice to see how this affects speed.

Chris

On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote:

> I just found out that Bio::DB::Fasta has an inherit 4GB file size  
> limit
> in it. This is due to how indexing information is stored. The module
> pack()s information using this format:
>
> use constant STRUCT =>'NNnnCa*';
>
> ... where the first token is the file offset. N = 32-bit unsigned
> integer, and rolls-over when the file position passes the 4GB mark,
> resulting in garbage out for those entries. Changing the packing  
> format to:
>
> use constant STRUCT =>'QNnnCa*';
>
> ...solves the problem (Q = 64-bit unsigned int). We have several  
> genomic
> files (ensembl dumps) where this is an issue:
>
> -rw-rw-r--  1 kirovs   bioinfo 7.2G Jul 13 12:28
> pan_troglodytes.genome.CHIMP1A.fa
> -rw-rw-r--  1 kirovs   bioinfo 6.8G Jul 13 12:25
> monodelphis_domestica.genome.BROADO3.fa
> -rw-rw-r--  1 kirovs   bioinfo 5.0G Jul 13 12:26
> mus_musculus.genome.NCBIM36.fa
> -rw-rw-r--  1 kirovs   bioinfo 4.6G Aug  2 15:31  
> bos_taurus.genome.Btau2.fa
> -rw-rw-r--  1 kirovs   bioinfo 4.1G Jul 13 12:22
> danio_rerio.genome.ZFISH6.fa
>
> These are not really large genomes, but have a fair number of
> unassembled (duplicitous) fragments in them, which bump up the file
> size. Some fully assembled genomes will probably eventually top the  
> 4GB
> mark, anyway.
>
> Unfortunately, this raises a backward compatibility issue, since an
> index packed with 'N' will fail when unpacked with 'Q'. Perhaps the
> module could dynamically bifurcate the packing structure based on a  
> file
> size test?
>
> The second token is for the sequence length, I can't imagine a single
> sequence exceeding 4Gb, so it's probably safe - yes? Should it also  
> be Q
> in the event that biology someday exceeds our current imagination?
>
> Thanks,
> CAT
>
> -- 
> Charles Tilford, Bioinformatics-Applied Genomics
> Bristol-Myers Squibb PRI, Hopewell 3A039
> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213
> charles.tilford at bms.com
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list