Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
1.2.1, 1.3.0, 2.2.0
-
None
-
None
Description
Checking for a match instead of find() would allow matcher to exit early instead of looking for sub-sequences beyond the first non-match.
In UDFLike.java, the complex pattern checker uses matches() and the vectorized version uses find(0), which is more expensive.
Benchmark Mode Cnt Score Error Units RegexBench.testGreedyRegexHit avgt 5 379.316 ± 32.444 ns/op RegexBench.testGreedyRegexHitCheck avgt 5 344.895 ± 15.436 ns/op RegexBench.testGreedyRegexMiss avgt 5 497.193 ± 18.168 ns/op RegexBench.testGreedyRegexMissCheck avgt 5 171.872 ± 8.588 ns/op
The miss in match is nearly ~3x more expensive per-row with the .find(0) over the .match() check version.
The pattern match scenario is nearly the same.
The lazy scenario makes it slower when there's a hit (because match runs the check till end, but ~2x faster when there's a miss).
RegexBench.testLazyRegexHit avgt 5 78.398 ± 6.007 ns/op RegexBench.testLazyRegexHitCheck avgt 5 120.557 ± 4.396 ns/op RegexBench.testLazyRegexMiss avgt 5 387.594 ± 25.672 ns/op RegexBench.testLazyRegexMissCheck avgt 5 154.489 ± 13.622 ns/op
Attachments
Attachments
Issue Links
- relates to
-
HIVE-14349 Vectorization: LIKE should anchor the regexes
- Closed