Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4480

Avoid many small spills in external data structures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.1.0
    • 1.1.1, 1.2.0
    • Spark Core
    • None

    Description

      The following output is provided by shenhong in SPARK-4380.

      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far)
      14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far)
      

      Spilling many small files has two implications. First, it can cause "too many open files" exceptions, as we observed in SPARK-3633. Second, it causes degradation in performance. We have seen slight performance regressions from 1.0.2 to 1.1.0, and this is likely the cause.

      Note that this is spun-off from SPARK-4452, the fixing of which involves a bigger change in the way we track shuffle memory. This issue is smaller in scope in that it only makes sure we don't constantly spill, regardless of the policy we use for tracking shuffle memory.

      Attachments

        Activity

          People

            andrewor14 Andrew Or
            andrewor14 Andrew Or
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: