[Bioperl-l] Bio::Matrix::Substitution alpha version

Allen Smith easmith@beatrice.rutgers.edu
Fri, 27 Sep 2002 20:53:47 -0400


Hi. I have an alpha version of Bio::Matrix::Substitution and
Bio::Matrix::SubstitutionI ready for public inspection. It includes not only 
these modules (which do have POD documentation) but a t/MatrixSubstitution.t 
test file and a couple of data files for testing in the t/data/
subdirectory.

Three things:
	A. What's the recommended means of submission of such? SSH is not
	   locally particularly available, thanks to IRIX not having a
	   /dev/random, incidentally. I can make it available for HTTP
	   access without problems, BTW.
	B. This is _just_ those two modules. Incorporation into the rest of
	   Bioperl (including SimpleAlign and maybe Bio::Tools::OddCodes) is 
	   a further project. Two major things will be needed for this:
		1. To efficiently match "these are the AAs/whatever that are 
		   closely related according to the matrix" to "these are
		   the AAs/whatever that we _have_", as in the
		   substitution/consensus groups in SimpleAlign and
		   Bio::Tools::OddCodes, the best method I've come up with
		   is conversion of the presence/absence of each AA/whatever 
		   into a bit in a bitstring, using vec, followed by bit
		   operations. This is considerably faster than using
		   regexes or Set::Scalar. It would be by far for the best
		   if this were also made into a new module (or set of
		   modules). Anyone have a good name?
		2. A proper means of associating matrices with alignments
		   _and with sequences_, and having this be easily
		   extensible to associate "this matrix is the best one to
		   use in these spots along this sequence, unless the other
		   sequence says to use something else" (as in, for
		   instance, being able to take a sequence for which the
		   structure is known and use different matrices between
		   alpha helical regions, beta sheet regions, etcetera, when 
		   matching/aligning to a homologous sequence for which the
		   structure has not been determined). The existing
		   Bio::Range stuff doesn't appear to quite match up with
		   the requirements for this - one needs to be able to say
		   "this matrix should be used for positions
		   3-5,7-9,11-13..." (e.g., if there's a partially-buried
		   structure and one is using matrices that vary depending
		   on degree of solvent exposure) without getting into an
		   insane number of seperate objects. Thoughts?
	C. I noted a slight further bug in SimpleAlign, and have put it into 
	   the bug-tracker (with a patch to solve the problem). The bug is
	   that SimpleAlign's consensus_iupac routine, when expanding
	   previous IUPAC symbols to the corresponding possible set of NAs,
	   wasn't always doing so when necessary (it did the one-to-two set
	   of expansions when a regexp matched [SKYWM] when it should have
	   been matching [SKYRWM]).

	-Allen

-- 
Allen Smith			http://cesario.rutgers.edu/easmith/
September 11, 2001		A Day That Shall Live In Infamy II
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin