Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48030

InternalRowComparableWrapper should cache rowOrdering to improve performace

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.1, 3.4.3
    • 4.0.0
    • SQL

    Description

      InternalRowComparableWrapper recreates row ordering for each output partition when SPJ is enabled. The row ordering is generated via codegen which is quite expensive and the output partitions might be quite large for production table such as hundreds of thousands partitions. We encountered this issue when applying SPJ with multiple large Iceberg tables and the plan phase took tens of minutes to complete.

      Attaching a screenshot to provide related stack trace:
       

      A simple fix for this would be caching the rowOrdering for InternalRowComparableWrapper as the datatype of the InternalRow is immutable

      Attachments

        1. screenshot-1.png
          403 kB
          YE

        Issue Links

          Activity

            People

              advancedxy YE
              advancedxy YE
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: