Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22458

OutOfDirectMemoryError with Spark 2.2

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Not A Problem
    • 2.2.0
    • None
    • Shuffle, Spark Core, SQL, YARN
    • None

    Description

      We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully.

      spark.master yarn
      spark.driver.cores 10
      spark.driver.maxResultSize 5g
      spark.driver.memory 20g
      spark.executor.cores 5
      spark.executor.extraJavaOptions -XX:+UseG1GC -Dio.netty.maxDirectMemory=1024 -XX:MaxGCPauseMillis=60000 -XX:MaxDirectMemorySize=2048m -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
      spark.driver.extraJavaOptions -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
      spark.executor.instances 30
      spark.executor.memory 30g
      spark.kryoserializer.buffer.max 512m

      spark.network.timeout 12000s
      spark.serializer org.apache.spark.serializer.KryoSerializer
      spark.shuffle.io.preferDirectBufs false
      spark.sql.catalogImplementation hive
      spark.sql.shuffle.partitions 5000
      spark.yarn.driver.memoryOverhead 1536
      spark.yarn.executor.memoryOverhead 4096
      spark.core.connection.ack.wait.timeout 600s
      spark.scheduler.maxRegisteredResourcesWaitingTime 15s
      spark.sql.hive.filesourcePartitionFileCacheSize 524288000

      spark.dynamicAllocation.executorIdleTimeout 30000s
      spark.dynamicAllocation.enabled true
      spark.hadoop.yarn.timeline-service.enabled false
      spark.shuffle.service.enabled true
      spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.3.0-37 -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m

      Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below

      Please let me any other details is requried,

      Snapshot for DirectMemory Error Stacktrace :-

      10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message=
      org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
              at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
              at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
              at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
              at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
              at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
              at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
              at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
              at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
              at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
              at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
              at org.apache.spark.scheduler.Task.run(Task.scala:108)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
              at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
              at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
              at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
              at io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
              at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
              at io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
              at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
              at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
              at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
              at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
              at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
              at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
              at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
              at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
              ... 1 more
      

      if i removed above netty configuration, getting below error

      Snapshot for Excedding memory overhead Stacktrace :-

      Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 in stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): ExecutorLostFailure (executor 125 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
      Driver stacktrace:
              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
              at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
              at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
              at scala.Option.foreach(Option.scala:257)
              at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
              ... 49 more
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            skp33 Kaushal Prajapati
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: