Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17113

Job failure due to Executor OOM in offheap mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 2.0.1, 2.1.0
    • Spark Core
    • None

    Description

      We have been seeing many job failure due to executor OOM with following stack trace -

      java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
      	at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
      	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
      	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
      	at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
      	at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
      	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:89)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Digging into the code, we found out that this is an issue with cooperative memory management for off heap memory allocation.

      In the code https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463, when the UnsafeExternalSorter is checking if memory page is being used by upstream, the base object in case of off heap memory is always null so the UnsafeExternalSorter does not spill the memory pages.

      Attachments

        Activity

          People

            sitalkedia@gmail.com Sital Kedia
            sitalkedia@gmail.com Sital Kedia
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: