Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
The InSet expression was introduced in SPARK-3711 to avoid O(n) time complexity in the In expression. As InSet relies on Scala immutable.Set, it introduces expensive autoboxing. As a consequence, the performance of InSet might be significantly slower than In even on 100+ values.
We need to find an approach how to optimize InSet expressions and avoid the cost of autoboxing.
There are a few approaches that we can use:
- Collections for primitive values (e.g., FastUtil, HPPC)
- Type specialization in Scala (e.g., OpenHashSet in Spark)
According to my local benchmarks, OpenHashSet, which is already available in Spark and uses type specialization, can significantly reduce the memory footprint. However, it slows down the computation even compared to the built-in Scala sets. On the other hand, FastUtil and HPPC did work and gave a substantial improvement in the performance. So, it makes sense to evaluate primitive collections.
See the attached screenshot of what I experienced while testing.