Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26204

Optimize InSet expression

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      The InSet expression was introduced in SPARK-3711 to avoid O(n) time complexity in the In expression. As InSet relies on Scala immutable.Set, it introduces expensive autoboxing. As a consequence, the performance of InSet might be significantly slower than In even on 100+ values.

      We need to find an approach how to optimize InSet expressions and avoid the cost of autoboxing.

       There are a few approaches that we can use:

      • Collections for primitive values (e.g., FastUtil,  HPPC)
      • Type specialization in Scala (e.g., OpenHashSet in Spark)

      According to my local benchmarks, OpenHashSet, which is already available in Spark and uses type specialization, can significantly reduce the memory footprint. However, it slows down the computation even compared to the built-in Scala sets. On the other hand, FastUtil and HPPC did work and gave a substantial improvement in the performance. So, it makes sense to evaluate primitive collections.

      See the attached screenshot of what I experienced while testing.

      Attachments

        1. heap size.png
          193 kB
          Anton Okolnychyi

        Activity

          People

            Unassigned Unassigned
            aokolnychyi Anton Okolnychyi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: