[Bioperl-l] indexing conservation scores

Thu Dec 23 00:39:12 UTC 2010

Maybe use a tied hash using BerkeleyDB or AnyDBM_File, or DBD::SQLite?  Also, maybe convert to BigWig and use Lincoln's Bio::DB::BigFile tools (note the installation process is a little tricky for this):

http://search.cpan.org/~lds/Bio-BigFile-1.04/lib/Bio/DB/BigWig.pm

Also, +1 to Sean's suggestion (don't rely completely on bioperl to implement everything :)

chris

On Dec 22, 2010, at 6:00 PM, Maxim wrote:

> Hi,
> 
> bio::db:fasta is a beautiful tool for fast access to sequences present in
> large flat text (fasta) files and I really love it. Now I'd like to speed up
> the retrieval of data from large files that store conservation scores. The
> files that I was able to find at UCSC have fixed step wiggle format, like
> 
> fixedStep chrom=chrYHet start=1 step=1
> 0.117
> 0.092
> 0.092
> 0.085
> 0.071
> 0.051
> 0.021
> 0.010
> 0.008
> 0.010
> 0.019
> 0.023
> 0.023
> 0.019
> ........
> 
> Does someone see a chance how to use the indexing mechanism used by
> bio::db::fasta in order to allow retrieval of float numbers. I could
> reformat the wiggle file to a simple space,tab or comma separated list of
> scores per chromosome.
> 
> Are there suggestions? Or is there indeed a module that takes care about my
> problem and I have just overlooked it?
> Or won't such an approach  get considerably faster than normal unix commands
> like:
> sed -n '2,5001p' chrYHet.pp
> to retrieve the scores?
> 
> 
> Maxim
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l