[SPARK-37649] Switch default index to distributed-sequence by default in pandas API on Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: PySpark
Labels:
- release-notes

Description

pandas API on Spark currently sets compute.default_index_type to sequence which relies on sending all data to one executor that easily causes OOM.

We should better switch to distributed-sequence type that truly distributes the data.

With this change, we can now leverage https://issues.apache.org/jira/browse/SPARK-36559 and https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users will benefit a lot of performance improvement.

Attachments

Issue Links

links to

[Github] Pull Request #34902 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Dec/21 00:30

Updated:: 12/Dec/22 18:10

Resolved:: 15/Dec/21 02:23