Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Join judgements requires large variables (dedupParams) that Spark will have to deserialize for every task.
For very large partition counts this adds a significant overhead. For jobs with sedona.join.numpartition=20000 (34k actual tasks) task deserialization time takes on average 200 ms.
By broadcasting dedupParams we've been able to reduce task deserialization time to 11 ms. Total job execution time is reduces by 20%.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
Attachments
Issue Links
- links to