[Biojava-dev] [Biojava-l] regex performance in Java

Mark Fortner phidias51 at gmail.com
Thu Oct 25 03:50:22 UTC 2012


Have you tried profiling the code to see where it's spending most of its
time?

Mark
On Oct 24, 2012 8:47 PM, "Hilmar Lapp" <hlapp at drycafe.net> wrote:

> The code is a very small snippet from a natural language processing
> software aimed at extracting structured phenotype descriptions from un- or
> semistructured free text. Apparently the code as is (in Perl) makes a lot
> of regular expression matches, and so if the speed difference for them
> between Perl and Java is significant, in theory this might become a
> problem. Though whether it will or will not amount to a bottleneck indeed
> remains to be seen, as the code is also doing other things that are
> potentially expensive, and possibly more so than the regex matching.
>
> So the exercise here is merely to see whether there is a notable
> performance difference in regex pattern evaluation that can't simply be
> attributed to programming mistakes (and apparently there is).
>
>         -hilmar
>
> On Oct 24, 2012, at 2:30 PM, P. Troshin wrote:
>
> > Hi Hilmar,
> >
> > Looked at the test in a bit more details, I can see what you are
> > trying to test but is there a real life problem behind this?
> > What this test is doing is a lot of searches on very short strings. Is
> > this what your real life application does? I am asking because if your
> > real life application uses regexp to look into long string, the
> > performance might be totally different.
> > What is your aim - 3 seconds for 500K searches do not seem
> > particularly slow to me.
> >
> > Thanks
> > Peter
> >
> >
> > On 24 October 2012 19:10, P. Troshin <to.petr at gmail.com> wrote:
> >> Hi Hilmar,
> >>
> >> Hmm, it looks like I spoke too soon; the previous run was doing
> >> nothing as all of the cases were commented out.
> >> I can now see that the results of my runs are not massively different
> >> from that of yours.
> >> It would help if you could encourage your student to write a few unit
> >> tests so that we know what you are trying to achieve and to simplify
> >> the testing.
> >>
> >> Just a thought
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On 24 October 2012 17:47, Hilmar Lapp <hlapp at drycafe.net> wrote:
> >>> Hi everyone,
> >>>
> >>> Thanks for all your responses. Indeed I know that the Java regex API
> isn't an enjoyable one to program with, and if the underlying task were
> about writing something from scratch, I'd be all for avoiding regex's too
> if the same thing could be achieved by string comparison.
> >>>
> >>> However, and of course I failed to say that initially, the task from
> which this query is originating is about converting a Perl script to Java
> (not because Perl is somehow bad, but because those Perl scripts have shown
> to be an obstacle to easy cross-platform installation of the - mostly Java
> - software they are a part of). That doesn't mean one couldn't in the
> course also rewrite the code that uses regular expressions to one that
> doesn't, but I also think it wise not to introduce multiple variables as a
> source of error at once.
> >>>
> >>> Some of the responses would be best answered by looking at the
> expressions and the code that uses them, so here are the two "benchmark"
> scripts.
> >>>
> >>> Java: https://gist.github.com/3940931
> >>> Perl: https://gist.github.com/3940780
> >>>
> >>> I'm also copying Dongye Meng here, who is a CS student at UNC working
> with us on the project - if anyone has further wisdom to share about how to
> reduce the performance gap between the two versions, he'd surely appreciate.
> >>>
> >>>        -hilmar
> >>>
> >>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote:
> >>>
> >>>> Hilmar Lapp <hlapp at drycafe.net> writes:
> >>>>> They (at least as in java.util.regex) have been reported to me as
> >>>>> performing much slower (by several orders of magnitude) than the
> regex
> >>>>> implementation in Perl, and some simple benchmarking tests seem to
> >>>>> bear that out. Even after scrutinizing the benchmark and finding
> >>>>> nothing obvious, I'm still skeptical as to why this would be the case
> >>>>> - naively I would have assumed that the underlying runtime library is
> >>>>> implemented in C in both cases. But perhaps this is not true?
> >>>>
> >>>>
> >>>> Well, the difference is that Perl is perl, while Java is not; it all
> >>>> depends on the JVM, and libraries also. A quick shuftie at
> >>>> the source for the open-jdk libraries suggests that the regexp
> searching
> >>>> is done in Java -- it's not just a drop through to C. Always the
> problem
> >>>> with performance optimisation on Java -- you are only optimising for
> one
> >>>> situation. It might be interesting to see how much variation there is
> >>>> between JVMs.
> >>>>
> >>>> Like others, I would only use regexp as a last resort in Java anyway;
> >>>> compared to Perl, writing the code is painful. Still, I guess that you
> >>>> know this!
> >>>>
> >>>> Phil
> >>>
> >>> --
> >>> ===========================================================
> >>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> >>> ===========================================================
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the biojava-dev mailing list