[Biojava-l] Behavior of the createRegex() method (MotifTool class)
Keith James
kdj@sanger.ac.uk
02 Dec 2002 10:44:27 +0000
>>>>> "Matthew" == Matthew Pocock <matthew_pocock@yahoo.co.uk> writes:
Matthew> Well spotted Sylvain, Keith, there's a method in
Matthew> AlphabetTools - getAllSymbols(). Feed it with the
Matthew> matches() map of the symbol & cat together the tokens
Matthew> from each of these.
I don't think this method is behaving as expected. Passing the
FiniteAlphabets from the following Symbols gets these results:
a -> getMatches() -> getAllSymbols -> tokenize -> -a
c -> getMatches() -> getAllSymbols -> tokenize -> -c
g -> getMatches() -> getAllSymbols -> tokenize -> -g
t -> getMatches() -> getAllSymbols -> tokenize -> -t
n -> getMatches() -> getAllSymbols -> tokenize -> tnn-nannnngncnnn
The code I am using is below (for a motif SymbolList with i Symbols).
Symbol sym = motif.symbolAt(i);
FiniteAlphabet ambiAlpha = (FiniteAlphabet) sym.getMatches();
Symbol [] ambiSyms = (Symbol [])
AlphabetManager.getAllSymbols(ambiAlpha).toArray(new Symbol[0]);
// getAllSymbols returns a Set (i.e. unordered) so
// we convert to char array so we can sort tokens
char [] ambiChars = new char [ambiSyms.length];
for (int j = 0; j < ambiSyms.length; j++)
{
ambiChars[j] =
sToke.tokenizeSymbol(ambiSyms[j]).charAt(0);
}
Arrays.sort(ambiChars);
sb.append(ambiChars);
So the final character class for 'n' comes out as [-acgnnnnnnnnnnnt]
--
- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -