Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20250

Improper OOM error when a task been killed while spilling data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
    • Fix Version/s: 2.0.3, 2.1.2, 2.2.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      When a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the TaskMemoryManager will throw an OOM exception.
      Then the executor takes it as UncaughtException, which will be handled by SparkUncaughtExceptionHandler and the executor will consequently be shutdown.
      However, this error may lead to the whole application failure due to the "max number of executor failures (30) reached".
      In our production environment, we have encountered a lot of such cases.

      17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (1 time so far)
      17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731
      17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
      17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task 16.0 in stage 3.0 (TID 857)
      17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
      java.nio.channels.ClosedByInterruptException
      	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
      	at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
      	at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
      	at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
      	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
      	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
      	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346)
              ................................................................
      17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (2  times so far)
      17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77
      17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
      17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
      java.nio.channels.ClosedByInterruptException
      	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
      	at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
      	at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
      	at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
      	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
      	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
      	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
              .........................................................................
      17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6
      17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size = 26010016 bytes, TID = 857
      17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage 3.0 (TID 857)
      java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : null
      	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178)
      	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
      	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
              ......................................................
      17/04/05 06:41:31 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
      java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : null
      	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178)
      	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
      	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:164)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:358)
      	..........................................
      17/04/05 06:41:31 INFO storage.DiskBlockManager: Shutdown hook called
      17/04/05 06:41:31 INFO util.ShutdownHookManager: Shutdown hook called
      

        Attachments

          Activity

            People

            • Assignee:
              coneyliu Xianyang Liu
              Reporter:
              wellfengzhu Feng Zhu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: