[Biojava-dev] Regexes and sequencing searching

Wed Apr 7 12:20:17 EDT 2004

On Wed, 7 Apr 2004, David Huen wrote:

> I have a need for software that will do a humongous amount of motif
> searching on a large number of long sequences.  I have tested a large
> number of combinations and the conclusions may be of interest.
>
> The trial case is 80 motifs searched on a smallish target of around > 1 Mb.
> The time includes all sequence reading/conversions/regex compilations/etc.
> Time is CPU secs.
>
> 1) Sequence wrapped in SymbolListCharSequence, then searched with
> java.util.regex.	17 secs.
> 2) Sequence converted to SimpleSymbolList, wrapped in
> SymbolListCharSequence, then searched with java.util.regex. 14 secs.
> 3) Sequence converted to String then searched with java.util.regex.  7 secs.

Very interesting -- might it be possible to split the timing for the
sequence creation and the searching?  Maybe there's some copying or list
resizing going on in there that we weren't aware of.

> I have also written my own SymbolList regex package (more than one!) that
> allows direct searching of sequence in any FiniteAlphabet.
> 4) Searching by recursive matching through a regex-like structure (an order
> of magnitude worse than anything else here!)
> 5) Compilation of RE -> NFA -> DFA -> ArrayStateMachine 25 secs.
> A change in the sequence implementation does not affect my symbollist regex
> package as it uses various tricks to reduce slow symbol access.
>
> Conclusions:
> 1) java.util.regex is bloody fast: do they have a native implementation?
> 2) the compressed sequence representations we use automatically for large
> sequences can affect some kinds of analyses that are dependent on a huge
> symbol bandwidth.  Even the fastest, most memory intensive representation
> is slower than String for individual "symbol" access.  Evidently there is a
> tradeoff between being able to have the whole sequence in RAM and swapping
> and comparative sizes of our packed representation and a String to consider
> in any final decision.

It sounds like you might have tried this, but what if you force the
sequence not to use the packed implementation?

>
> Also, do we really want a direct SymbolList regex package?
> Pros:
> Can search in any alphabet including those that cannot be handled with
> java.util.regex because of the absence of a suitable tokenization (large
> numbers of symbols).  Search alignments in protein cross-product space?
> Cons:
> yet another package to maintain.
> Slower than java.util.regex.
>
> Regards,
> David Huen
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>