Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18458

core dumped running Spark SQL on large data volume (100TB)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: SQL
    • Labels:
    • Target Version/s:

      Description

      Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors.

      The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error).

      Spark output that showed the exception:

      16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_000074 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch.
      Container id: container_e68_1478924651089_0018_01_000074
      Exit code: 134
      Exception message: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr
      
      Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr
      
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
              at org.apache.hadoop.util.Shell.run(Shell.java:456)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
              at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      
      
      Container exited with a non-zero exit code 134
      
      

      According to the source code, exit code 134 is 128+6, and 6 is SIGABRT 6 Core Abort signal from abort(3). The external signal killed executors.

      On the YARN side, the log is more clear:

      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136
      #
      # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03)
      # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # J 10342% C2 org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec]
      #
      # Core dump written. Default location: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core or core.3694385
      #
      # An error report file with more information is saved as:
      # /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log
      #
      # If you would like to submit a bug report, please visit:
      #   http://bugreport.java.com/bugreport/crash.jsp
      #
      

      And the hs_err_pid3694385.log shows the stack:

      Stack: [0x00007fff85432000,0x00007fff85533000],  sp=0x00007fff85530ce0,  free space=1019k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc]
      j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138
      j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209
      j  org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56
      j  org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62
      j  org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4
      j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24
      j  org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11
      j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4
      j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147
      j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552
      j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17
      J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c]
      j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4
      j  scala.collection.Iterator$$anon$11.hasNext()Z+4
      j  org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3
      j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222
      j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2
      j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152
      j  org.apache.spark.executor.Executor$TaskRunner.run()V+423
      j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
      j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
      j  java.lang.Thread.run()V+11
      v  ~StubRoutines::call_stub
      V  [libjvm.so+0x63d6ba]
      V  [libjvm.so+0x63ab74]
      V  [libjvm.so+0x63b189]
      V  [libjvm.so+0x67e6a1]
      V  [libjvm.so+0x9b3f5a]
      V  [libjvm.so+0x869722]
      C  [libpthread.so.0+0x7dc5]  start_thread+0xc5
      

      This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, but easily reproducible on 100TB+...so look into data types that may not be big enough to handle hundreds of billion.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kiszk Kazuaki Ishizaki
                Reporter:
                jfchen@us.ibm.com JESSE CHEN
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: