[Biojava-l] Behavior of the createRegex() method (MotifTool class)
Keith James
kdj@sanger.ac.uk
01 Dec 2002 18:42:00 +0000
>>>>> "Sylvain" == Sylvain Foisy <sylvain.foisy@bioneq.qc.ca> writes:
Sylvain> Hi, I used the createRegex() method to return a regular
Sylvain> expression from a sequence of DNA inputted by the user to
Sylvain> scan a genome for that motif. I just discovered an
Sylvain> interesting thing about that method: if n is in the motif
Sylvain> to seek, the regex will not have n as a possibility.
Sylvain> Ok, I have that motif: atgnnnndgta.
Sylvain> CreateRegex would return: atg[atcg]{4}gta and it does
Sylvain> What if my sequence to scan contains n: atgagcngta, for
Sylvain> exemple. Java.util.regex would not find the
Sylvain> pattern. Unless mistaken, the pattern should be
Sylvain> atg[atcgn]{4}gta.
Sylvain> Am I wrong? Any input would be appreciated
You are correct about the behaviour, but not about the solution. An
ambiguous target sequence could contain n, but could also contain r,
y, m, k, s, w, h, b, v and d. To match correctly the regex would have
to take into account that the symbols represented by n are a superset
of those represented by the other ambiguity symbols.
As MotifTools is generic (it will work for any alphabet) implementing
generation of regexes for searching ambiguous SymbolLists requires a
more complex algorithm than the current one. I'll take a look at this
as soon as I can.
Keith
--
- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -