Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14318

Vectorization: LIKE should use matches() instead of find(0)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 1.2.1, 1.3.0, 2.2.0
    • None
    • Vectorization
    • None

    Description

      Checking for a match instead of find() would allow matcher to exit early instead of looking for sub-sequences beyond the first non-match.

      In UDFLike.java, the complex pattern checker uses matches() and the vectorized version uses find(0), which is more expensive.

      Benchmark                            Mode  Cnt    Score    Error  Units
      RegexBench.testGreedyRegexHit        avgt    5  379.316 ± 32.444  ns/op
      RegexBench.testGreedyRegexHitCheck   avgt    5  344.895 ± 15.436  ns/op
      RegexBench.testGreedyRegexMiss       avgt    5  497.193 ± 18.168  ns/op
      RegexBench.testGreedyRegexMissCheck  avgt    5  171.872 ±  8.588  ns/op
      

      The miss in match is nearly ~3x more expensive per-row with the .find(0) over the .match() check version.

      The pattern match scenario is nearly the same.

      The lazy scenario makes it slower when there's a hit (because match runs the check till end, but ~2x faster when there's a miss).

      RegexBench.testLazyRegexHit          avgt    5   78.398 ±  6.007  ns/op
      RegexBench.testLazyRegexHitCheck     avgt    5  120.557 ±  4.396  ns/op
      RegexBench.testLazyRegexMiss         avgt    5  387.594 ± 25.672  ns/op
      RegexBench.testLazyRegexMissCheck    avgt    5  154.489 ± 13.622  ns/op
      

      Attachments

        1. HIVE-14318.1.patch
          0.8 kB
          Gopal Vijayaraghavan

        Issue Links

          Activity

            People

              gopalv Gopal Vijayaraghavan
              gopalv Gopal Vijayaraghavan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: