Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26205

Optimize InSet expression for bytes, shorts, ints, dates

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

      In expressions are compiled into a sequence of if-else statements, which results in O(n) time complexity. InSet is an optimized version of In, which is supposed to improve the performance if the number of elements is big enough. However, InSet actually degrades the performance in many cases due to various reasons (benchmarks were created in SPARK-26203 and solutions to the boxing problem are discussed in SPARK-26204).

      The main idea of this JIRA is to use Java switch statements to significantly improve the performance of InSet expressions for bytes, shorts, ints, dates. All switch statements are compiled into tableswitch and lookupswitch bytecode instructions. We will have O(1) time complexity if our case values are compact and tableswitch can be used. Otherwise, lookupswitch will give us O(log n). Our local benchmarks show that this logic is more than two times faster even on 500+ elements than using primitive collections in InSet expressions. As Spark is using Scala HashSet right now, the performance gain will be is even bigger.

      See here and here for more information.

      Attachments

        Issue Links

          Activity

            People

              aokolnychyi Anton Okolnychyi
              aokolnychyi Anton Okolnychyi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: