Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25947

Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.2
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When sorting rows, ShuffleExchangeExec uses the entire row instead of just the columns references in SortOrder to create the RangePartitioner. This causes the RangePartitioner to sample entire rows to create rangeBounds and can cause OOM issues on the driver when rows contain large fields.

      Create a projection and only use columns involved in the SortOrder for the RangePartitioner

        Attachments

          Activity

            People

            • Assignee:
              mu5358271 Shuheng Dai
              Reporter:
              mu5358271 Shuheng Dai

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment