[Bioperl-l] Limitations for Bio::DB::Fasta

Sun Jul 13 15:22:46 UTC 2014

Which file?  It’s something we could probably check. My feeling is it is one or more of:

1) Your version of perl doesn’t support large files efficiently (unlikely unless you are using old versions of perl).  But this should fail I think

2) DB_File itself isn’t very efficient if you have tons of sequences (millions).  Is that the case?

3) IO is ‘inefficient’, in other words you are running this on a non-optimal system where disk is a limiting factor.

Hard to say w/o testing it directly.  

There are alternatives just to note (samtools faidx comes to mind).

chris

On Jul 7, 2014, at 1:12 PM, Ki Baik <hkbaik at gmail.com> wrote:

> I'm trying to index a large fasta file that I downloaded from NCBI's ftp site. The size of the fasta file is 700GB. I'm trying to use Bio::DB::Fasta to index this file. When the index file hits around 10GB, it seems to hang. I'm wondering if there is a limit on the fasta file size it can index.
> 
> Also, how does Bio::DB::Fasta compare to Bio::Index::Fasta? Is one better for large fasta files? Are there any other indexing schemes I can use instead of these modules? Any information would be appreciated.
> 
> Thanks,
> 
> KB
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l