Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37649

Switch default index to distributed-sequence by default in pandas API on Spark

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • PySpark

    Description

      pandas API on Spark currently sets compute.default_index_type to sequence which relies on sending all data to one executor that easily causes OOM.

      We should better switch to distributed-sequence type that truly distributes the data.

      With this change, we can now leverage https://issues.apache.org/jira/browse/SPARK-36559 and https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users will benefit a lot of performance improvement.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment