Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.3.0
Description
pandas API on Spark currently sets compute.default_index_type to sequence which relies on sending all data to one executor that easily causes OOM.
We should better switch to distributed-sequence type that truly distributes the data.
With this change, we can now leverage https://issues.apache.org/jira/browse/SPARK-36559 and https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users will benefit a lot of performance improvement.