Bioperl: Any non-redundant database tools out there ???

Steve A. Chervitz sac@alberich.Stanford.EDU
Fri, 28 Aug 1998 04:07:02 -0700 (PDT)


Gordon,

I have some code that you may find useful. It was from an experiment 
to test Jarkko Hietaniemi's String::Approx.pm for use with biosequences
(generally it works pretty well, but is a little buggy). It can also 
cluster all unique sequence from a set (use the -noneighb option). It reads 
Fasta-formatted sequences only.

http://genome-www.stanford.edu/perlOOP/bioperl/bin/cluster_seq.pl

This script requires some modules that are included with my Blast 
distribution, as well as String::Approx.pm from CPAN (if you want to do 
approximate matching).

I'd be interested in any feedback you might have if you try it out.
 
Steve Chervitz
sac@genome.stanford.edu


On Thu, 27 Aug 1998, Ewan Birney wrote:

> 
> Gordon posted this is to 'guts' but it seems much more
> appropiate to post the main mailing list, hence I am
> forwarding it.
> 
> 
> 
> Ewan Birney
> <birney@sanger.ac.uk>
> http://www.sanger.ac.uk/Users/birney/
> 
> ---------- Forwarded message ----------
> Date: Thu, 27 Aug 1998 11:11:50 -0500
> From: Gordon D. Pusch <pusch@mcs.anl.gov>
> To: vsns-bcd-perl-guts@lists.uni-bielefeld.de
> Subject: Bioperl-guts: Any non-redundant database tools out there ???
> 
> Hi --- I am trying to construct a ``non-redundant'' version of WIT's
> sequence database. An obvious stupid-but-simple way to do this would
> be to use the sequence itself as the key to a hash of ID lists.
> 
> However, since there are a LOT of sequences, the whole thing obviously
> won't fit into memory and we will have to store the hash as a Berkeley-DB;
> and off course, some of the sequences are quite long.  I worry about such
> enormously long keys ``breaking'' something in either perl5 or Berkeley-DB's
> hash routines ---I gather they are stored internally as B-trees, so I
> could easily imagine very long keys producing stack-overflows during a
> tree traversal if the trees got too deep... :-(
> 
> Has anyone on this list implemented a non-redundant database-builder 
> in perl ???  
> 
> Does anyone know if there =IS= there a limit as to how long a hash-key
> can be for either perl5 or Berkeley-DB ???  If so, what are the usual
> failure-modes ???
> 
> Can anyone suggest a more elegant algorithm than the ``stupid-but-simple'' 
> method outlined above ???
> 
> 
> Thanks in advance,
> 
> --  Gordon D. Pusch   <pusch@mcs.anl.gov>
> 
> Disclaimer:  I'm a consultant collaborating with Argonne researchers;
> I don't speak for ANL or the DOE --- and they *certainly* don't speak
> for =ME= !!!
> 
> Claimer:  I report =ALL= SPAMvertisers to their ISP --- =NO= exceptions !!!
> 
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl-guts.html
> ====================================================================
> 
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 



=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================