Details

      Description

      Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';

      Improvement Opportunities

      1. Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
      2. Optimize the "contains" for-loop

      Implementation Details
      1)

      • Added a new integer variable "asciiMode" to the VarCharHolder class
      • The default value is -1 which indicates this info is not known
      • Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
      • The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
      • The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations

      2)

      • The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
      • Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons

      Benchmarks

      • Lineitem table 100GB
      • Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
      • Before changes: 33sec
      • After changes : 27sec

        Attachments

          Activity

            People

            • Assignee:
              sachouche salim achouche
              Reporter:
              sachouche salim achouche
              Reviewer:
              Paul Rogers
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: