Description
In expressions are compiled into a sequence of if-else statements, which results in O(n) time complexity. InSet is an optimized version of In, which is supposed to improve the performance if the number of elements is big enough. However, InSet actually degrades the performance in many cases due to various reasons (benchmarks were created in SPARK-26203 and solutions to the boxing problem are discussed in SPARK-26204).
The main idea of this JIRA is to use Java switch statements to significantly improve the performance of InSet expressions for bytes, shorts, ints, dates. All switch statements are compiled into tableswitch and lookupswitch bytecode instructions. We will have O(1) time complexity if our case values are compact and tableswitch can be used. Otherwise, lookupswitch will give us O(log n). Our local benchmarks show that this logic is more than two times faster even on 500+ elements than using primitive collections in InSet expressions. As Spark is using Scala HashSet right now, the performance gain will be is even bigger.
Attachments
Issue Links
- causes
-
SPARK-29100 Codegen with switch in InSet expression causes compilation error
- Resolved
- links to