Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
1.1.0
-
None
Description
The following output is provided by shenhong in SPARK-4380.
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far)
Spilling many small files has two implications. First, it can cause "too many open files" exceptions, as we observed in SPARK-3633. Second, it causes degradation in performance. We have seen slight performance regressions from 1.0.2 to 1.1.0, and this is likely the cause.
Note that this is spun-off from SPARK-4452, the fixing of which involves a bigger change in the way we track shuffle memory. This issue is smaller in scope in that it only makes sure we don't constantly spill, regardless of the policy we use for tracking shuffle memory.