Hive
  1. Hive
  2. HIVE-4160 Vectorized Query Execution in Hive
  3. HIVE-4548

Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: vectorization-branch
    • Fix Version/s: vectorization-branch, 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern special cases (here, abc is just a place holder for some fixed string).

      Problem: The current vectorized LIKE implementation always calls the standard LIKE function code in UDFLike.java. But this is pretty expensive. It calls multiple functions and allocates at least one new object per call. Probably 80% of uses of LIKE are for the simple patterns abc%, %abc, and %abc%. These can be implemented much more efficiently.

      Start by speeding up the case for

      Column LIKE "abc%"

      The goal would be to minimize expense in the inner loop. Don't use new() in the inner loop, and write a static function that checks the prefix of the string matches the like pattern as efficiently as possible, operating directly on the byte array holding UTF-8-encoded string data, and avoiding unnecessary additional function calls and if/else logic. Call that in the inner loop.

      If feasible, consider using a template-driven approach, with an instance of the template expanded for each of the three cases. Start doing the abc% (prefix match) by hand, then consider templatizing for the other two cases.

      The code is in the "vectorization" branch of the main hive repo.

      Start by checking in the constructor for FilterStringColLikeStringScalar.java if the pattern is one of the simple special cases. If so, record that, and have the evaluate() method call a special-case function for each case, i.e. the general case, and each of the 3 special cases. All the dynamic decision-making would be done once per vector, not once per element.

      1. HIVE-4548.3.without-benchmark.patch.txt
        19 kB
        Teddy Choi
      2. HIVE-4548.3.with-benchmark.patch.txt
        29 kB
        Teddy Choi
      3. HIVE-4548.2-without-benchmark.patch.txt
        18 kB
        Teddy Choi
      4. HIVE-4548.2-with-benchmark.patch.txt
        29 kB
        Teddy Choi
      5. HIVE-4548.1-without-benchmark.patch.txt
        15 kB
        Teddy Choi
      6. HIVE-4548.1-with-benchmark.patch.txt
        27 kB
        Teddy Choi

        Activity

        Thejas M Nair made changes -
        Fix Version/s 0.13.0 [ 12324986 ]
        Ashutosh Chauhan made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Ashutosh Chauhan added a comment -

        Committed to branch. Thanks, Teddy!

        Show
        Ashutosh Chauhan added a comment - Committed to branch. Thanks, Teddy!
        Teddy Choi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Teddy Choi added a comment -

        Please apply HIVE-4548.3.without-benchmark.patch.txt file.

        Show
        Teddy Choi added a comment - Please apply HIVE-4548 .3.without-benchmark.patch.txt file.
        Teddy Choi made changes -
        Status In Progress [ 3 ] Open [ 1 ]
        Hide
        Eric Hanson added a comment -

        Teddy,

        This look good. Thanks! Please add a comment to make it clear which patch should be applied to the branch. You have two versions of each patch, "with-benchmark" and "without-benchmark". Then mark this JIRA as "patch available."

        Show
        Eric Hanson added a comment - Teddy, This look good. Thanks! Please add a comment to make it clear which patch should be applied to the branch. You have two versions of each patch, "with-benchmark" and "without-benchmark". Then mark this JIRA as "patch available."
        Hide
        Teddy Choi added a comment -

        I uploaded the new patch file, and updated the review request on https://reviews.apache.org/r/11222/ .

        Show
        Teddy Choi added a comment - I uploaded the new patch file, and updated the review request on https://reviews.apache.org/r/11222/ .
        Teddy Choi made changes -
        Attachment HIVE-4548.3.with-benchmark.patch.txt [ 12584944 ]
        Attachment HIVE-4548.3.without-benchmark.patch.txt [ 12584945 ]
        Hide
        Teddy Choi added a comment -

        I agree. I wrote some test for 4 byte UTF-8 character, and it works okay. I'll upload this patch tomorrow.

        public class TestVectorStringExpressions {
          public void testStringLikeMultiByte() throws UnsupportedEncodingException {
            FilterStringColLikeStringScalar expr;
            VectorizedRowBatch batch;
            batch = makeStringBatchMixedCharSize();
            Text text = new Text();
        
            // verify that a multi byte LIKE expression matches a matching string
            text.append("%".getBytes("UTF-8"), 0, 1);
            text.append(multiByte, 0, multiByte.length);
            text.append("%".getBytes("UTF-8"), 0, 1);
            expr = new FilterStringColLikeStringScalar(0, text);
            expr.evaluate(batch);
            Assert.assertEquals(batch.size, 1);
            
            text.clear();
        
            // verify that a multi byte LIKE expression doesn't match a non-matching string
            text.append("%".getBytes("UTF-8"), 0, 1);
            text.append(multiByte, 0, multiByte.length);
            text.append("x".getBytes("UTF-8"), 0, 1);
            batch = makeStringBatchMixedCharSize();
            expr = new FilterStringColLikeStringScalar(0, text);
            expr.evaluate(batch);
            Assert.assertEquals(batch.size, 0);
          }
        }
        
        Show
        Teddy Choi added a comment - I agree. I wrote some test for 4 byte UTF-8 character, and it works okay. I'll upload this patch tomorrow. public class TestVectorStringExpressions { public void testStringLikeMultiByte() throws UnsupportedEncodingException { FilterStringColLikeStringScalar expr; VectorizedRowBatch batch; batch = makeStringBatchMixedCharSize(); Text text = new Text(); // verify that a multi byte LIKE expression matches a matching string text.append( "%" .getBytes( "UTF-8" ), 0, 1); text.append(multiByte, 0, multiByte.length); text.append( "%" .getBytes( "UTF-8" ), 0, 1); expr = new FilterStringColLikeStringScalar(0, text); expr.evaluate(batch); Assert.assertEquals(batch.size, 1); text.clear(); // verify that a multi byte LIKE expression doesn't match a non-matching string text.append( "%" .getBytes( "UTF-8" ), 0, 1); text.append(multiByte, 0, multiByte.length); text.append( "x" .getBytes( "UTF-8" ), 0, 1); batch = makeStringBatchMixedCharSize(); expr = new FilterStringColLikeStringScalar(0, text); expr.evaluate(batch); Assert.assertEquals(batch.size, 0); } }
        Hide
        Eric Hanson added a comment -

        It appears that all the specific characters you are checking for in parseSimplePattern (%, _, ) cannot be the first or last character of a surrogate pair. So I think the code is safe. Please think this through and add some unit tests that process multi-byte UTF-8 characters of 3 bytes or more (which will force encoding as surrogate pairs inside a String).

        See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000_to_U.2B10FFFF for a discussion of surrogate pairs.

        See http://en.wikipedia.org/wiki/List_of_Unicode_characters for a list of Unicode characters. % is 0x0025, _ is 0x005F, and \ is 0x005C. Surrogate pairs are all have lead surrogates in the range 0xD800..0xDBFF and trail surrogates in the range 0xDC00..0xDFFF.

        Show
        Eric Hanson added a comment - It appears that all the specific characters you are checking for in parseSimplePattern (%, _, ) cannot be the first or last character of a surrogate pair. So I think the code is safe. Please think this through and add some unit tests that process multi-byte UTF-8 characters of 3 bytes or more (which will force encoding as surrogate pairs inside a String). See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000_to_U.2B10FFFF for a discussion of surrogate pairs. See http://en.wikipedia.org/wiki/List_of_Unicode_characters for a list of Unicode characters. % is 0x0025, _ is 0x005F, and \ is 0x005C. Surrogate pairs are all have lead surrogates in the range 0xD800..0xDBFF and trail surrogates in the range 0xDC00..0xDFFF.
        Hide
        Eric Hanson added a comment -

        Can you confirm if there is a problem or not? E.g. is it possible for a % character to show up as the first or second character of a 2-character sequence in a String that represents a character beyond the standard set of 0x0000 to 0xFFFF.

        If it is indeed a problem, then we should fix it here and open another JIRA to report a bug in the original UDFLike.

        Show
        Eric Hanson added a comment - Can you confirm if there is a problem or not? E.g. is it possible for a % character to show up as the first or second character of a 2-character sequence in a String that represents a character beyond the standard set of 0x0000 to 0xFFFF. If it is indeed a problem, then we should fix it here and open another JIRA to report a bug in the original UDFLike.
        Hide
        Teddy Choi added a comment -

        I read your review and wrote a reply on it. I'll try the Text class.

        However, I'm not sure whether it is okay to fix a same problem on the UDFLike class in this issue.

        Show
        Teddy Choi added a comment - I read your review and wrote a reply on it. I'll try the Text class. However, I'm not sure whether it is okay to fix a same problem on the UDFLike class in this issue.
        Hide
        Eric Hanson added a comment -

        See my comments on Review Board. It looks good but I think there is a functional issue related to multi-byte characters beyond standard 16 bit Unicode.

        Show
        Eric Hanson added a comment - See my comments on Review Board. It looks good but I think there is a functional issue related to multi-byte characters beyond standard 16 bit Unicode.
        Teddy Choi made changes -
        Attachment HIVE-4548.2-with-benchmark.patch.txt [ 12584001 ]
        Attachment HIVE-4548.2-without-benchmark.patch.txt [ 12584002 ]
        Hide
        Teddy Choi added a comment -

        The patch was fixed as Eric Hanson reviewed. It is written in correct code style. It has detailed comments and unit tests.

        Moreover, the capacity of a new byte buffer is now double of the requested capacity. It's for further performance optimization.

        By the way, I found that TestConstantVectorExpression needs to import VectorizedRowGroupGenUtil to be compiled.

        Show
        Teddy Choi added a comment - The patch was fixed as Eric Hanson reviewed. It is written in correct code style. It has detailed comments and unit tests. Moreover, the capacity of a new byte buffer is now double of the requested capacity. It's for further performance optimization. By the way, I found that TestConstantVectorExpression needs to import VectorizedRowGroupGenUtil to be compiled.
        Hide
        Eric Hanson added a comment -

        Posted review. Looks good overall! Needs a few small changes and additional unit test(s).

        Show
        Eric Hanson added a comment - Posted review. Looks good overall! Needs a few small changes and additional unit test(s).
        Hide
        Teddy Choi added a comment -
        Show
        Teddy Choi added a comment - Review request on https://reviews.apache.org/r/11222/
        Teddy Choi made changes -
        Attachment HIVE-4548.1-with-benchmark.patch.txt [ 12583660 ]
        Attachment HIVE-4548.1-without-benchmark.patch.txt [ 12583661 ]
        Hide
        Teddy Choi added a comment -

        I edited FilterStringColLikeStringScala.java as Eric Hanson wrote.

        For none-complex patterns, it calls a static method that doesn't call others and uses its given byte arrays only. For complex patterns, it reuses a ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions.

        There is 30%~170% performance improvement for all cases. Its benchmark result is attached.

        test:
             [echo] Project: ql
            [junit] WARNING: multiple versions of ant detected in path for junit 
            [junit]          jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
            [junit]      and jar:file:/Users/pudidic/IdeaProjects/hive/build/ivy/lib/hadoop0.20S.shim/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
            [junit] Running org.apache.hadoop.hive.ql.exec.vector.expressions.TestFilterStringColLikeStringScalar
            [junit] ----
            [junit] mix%
            [junit] new 1077ms.
            [junit] old 2908ms.
            [junit] 170.00928%	faster
            [junit] ----
            [junit] %Up
            [junit] new 1008ms.
            [junit] old 2244ms.
            [junit] 122.61906%	faster
            [junit] ----
            [junit] %dU%
            [junit] new 1792ms.
            [junit] old 3350ms.
            [junit] 86.94197%	faster
            [junit] ----
            [junit] m%dU%
            [junit] new 17290ms.
            [junit] old 24224ms.
            [junit] 40.104103%	faster
            [junit] ----
            [junit] mixedUp
            [junit] new 1347ms.
            [junit] old 2907ms.
            [junit] 115.81292%	faster
            [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 58.683 sec
        
        BUILD SUCCESSFUL
        Total time: 1 minute 57 seconds
        

        It still can be more efficient by using a template-driven approach. I'll apply it soon.

        Show
        Teddy Choi added a comment - I edited FilterStringColLikeStringScala.java as Eric Hanson wrote. For none-complex patterns, it calls a static method that doesn't call others and uses its given byte arrays only. For complex patterns, it reuses a ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions. There is 30%~170% performance improvement for all cases. Its benchmark result is attached. test: [echo] Project: ql [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/pudidic/IdeaProjects/hive/build/ivy/lib/hadoop0.20S.shim/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hive.ql.exec.vector.expressions.TestFilterStringColLikeStringScalar [junit] ---- [junit] mix% [junit] new 1077ms. [junit] old 2908ms. [junit] 170.00928% faster [junit] ---- [junit] %Up [junit] new 1008ms. [junit] old 2244ms. [junit] 122.61906% faster [junit] ---- [junit] %dU% [junit] new 1792ms. [junit] old 3350ms. [junit] 86.94197% faster [junit] ---- [junit] m%dU% [junit] new 17290ms. [junit] old 24224ms. [junit] 40.104103% faster [junit] ---- [junit] mixedUp [junit] new 1347ms. [junit] old 2907ms. [junit] 115.81292% faster [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 58.683 sec BUILD SUCCESSFUL Total time: 1 minute 57 seconds It still can be more efficient by using a template-driven approach. I'll apply it soon.
        Teddy Choi made changes -
        Field Original Value New Value
        Status Open [ 1 ] In Progress [ 3 ]
        Eric Hanson created issue -

          People

          • Assignee:
            Teddy Choi
            Reporter:
            Eric Hanson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development