[SPARK-26204] Optimize InSet expression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

The InSet expression was introduced in ~~SPARK-3711~~ to avoid O(n) time complexity in the In expression. As InSet relies on Scala immutable.Set, it introduces expensive autoboxing. As a consequence, the performance of InSet might be significantly slower than In even on 100+ values.

We need to find an approach how to optimize InSet expressions and avoid the cost of autoboxing.

There are a few approaches that we can use:

Collections for primitive values (e.g., FastUtil, HPPC)
Type specialization in Scala (e.g., OpenHashSet in Spark)

According to my local benchmarks, OpenHashSet, which is already available in Spark and uses type specialization, can significantly reduce the memory footprint. However, it slows down the computation even compared to the built-in Scala sets. On the other hand, FastUtil and HPPC did work and gave a substantial improvement in the performance. So, it makes sense to evaluate primitive collections.

See the attached screenshot of what I experienced while testing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

heap size.png
28/Nov/18 16:33
193 kB
Anton Okolnychyi

Activity

People

Assignee:: Unassigned

Reporter:: Anton Okolnychyi

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Nov/18 16:31

Updated:: 16/Mar/20 22:52