Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5879

Optimize "Like" operator

    XMLWordPrintableJSON

Details

    Description

      Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';

      Improvement Opportunities

      1. Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
      2. Optimize the "contains" for-loop

      Implementation Details
      1)

      • Added a new integer variable "asciiMode" to the VarCharHolder class
      • The default value is -1 which indicates this info is not known
      • Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
      • The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
      • The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations

      2)

      • The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
      • Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons

      Benchmarks

      • Lineitem table 100GB
      • Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
      • Before changes: 33sec
      • After changes : 27sec

      Attachments

        Activity

          People

            sachouche Salim Achouche
            sachouche Salim Achouche
            Paul Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: