Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29767

Core dump happening on executors while doing simple union of Data Frames

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.4.4
    • None
    • PySpark, Spark Core
    • None
    • AWS EMR 5.27.0, Spark 2.4.4

    Description

      Running a union operation on two DataFrames through both Scala Spark Shell and PySpark, resulting in executor contains doing a core dump and existing with Exit code 134.

      The trace from the Driver:

      Container exited with a non-zero exit code 134
      .
      19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container from a bad node: container_1572981097605_0021_01_000077 on host: ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from container-launch.
      Container id: container_1572981097605_0021_01_000077
      Exit code: 134
      Exception message: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderrStack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderr	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
      	at org.apache.hadoop.util.Shell.run(Shell.java:869)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Container exited with a non-zero exit code 134

      From the stdout logs of the exiting container we see:

      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00007f825e3b0e92, pid=12611, tid=0x00007f822b5fb700
      #
      # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
      # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # V  [libjvm.so+0xa9ae92]
      #
      # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
      #
      # An error report file with more information is saved as:
      # /mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/hs_err_pid12611.log
      #
      # If you would like to submit a bug report, please visit:
      #   http://bugreport.java.com/bugreport/crash.jsp
      #

      Also, I am unable to enable core dump even though ulimit -c is set to unlimited. Can you help on how to go about this issue, and also how to get the core dump ?

      Steps to reproduce the issue:

      • Upload the attached parquet data file to S3 s3://<bucket>/tables/spark_29767_parquet_table/inserted_at=201910/
      • Create a partitioned hive table
      CREATE EXTERNAL TABLE `spark_29767_parquet_table`(
        `hour` bigint, 
        `title` string, 
        `__deleted` string, 
        `status` string, 
        `transformationid` string, 
        `roomid` string, 
        `day` bigint, 
        `notes` string, 
        `nunitsfromaudit` bigint, 
        `ts_ms` bigint, 
        `liability` string, 
        `_class` string, 
        `month` bigint, 
        `updatedate` struct<`date`:bigint>, 
        `_id` struct<oid:string>, 
        `year` bigint, 
        `item` struct<name:string,brandname:string,perunitpricefromaudit:struct<currency:string,amount:string>,actualPerUnitPrice:struct<currency:string,amount:string>,category:string,itemType:string,roomAmenityId:bigint>, 
        `createddate` struct<`date`:bigint>, 
        `actualunits` bigint, 
        `description` string)
      PARTITIONED BY ( 
        `inserted_at` string)
      ROW FORMAT SERDE 
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      STORED AS INPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
      OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      LOCATION
        's3://<bucket>/tables/spark_29767_parquet_table'
      
      •  Sync partition
      ALTER TABLE spark_29767_parquet_table ADD PARTITION (inserted_at='201910') location 's3://<bucket>/tables/spark_29767_parquet_table/inserted_at=201910/'
      
      • In pyspark run the following:
      // Read the base data frame
      
      from pyspark import SparkContext, SparkConf
      from pyspark.sql import SparkSession, HiveContext
      from pyspark.sql.functions import lit
      sparkSession = (SparkSession
                      .builder
                      .appName('example-pyspark-read-and-write-from-hive')
                      .enableHiveSupport()
                      .getOrCreate())base_df = sparkSession.table("spark_29767_parquet_table")
      
      base_df = sparkSession.table("spark_29767_parquet_table")
      base_df = base_df.select("_id", "_class", "roomid", "item", "inserted_at")
      
      // Create a new dataframe with one row for union
      
      from pyspark.sql import *
      import pyspark.sql.types
      from pyspark.sql.types import *
      
      schema = StructType([
      StructField("_id",StructType([StructField("oid",StringType(),True)]),True),
      StructField("_class",StringType(),True),
      StructField("roomid",StringType(),True),
      StructField("item",StructType([
      StructField("name",StringType(),True),
      StructField("brandname",StringType(),True),
      StructField("perunitpricefromaudit",
      StructType([
      StructField("currency",StringType(),True),
      StructField("amount",StringType(),True)]),True),
      StructField("actualperunitprice",StructType([
      StructField("currency",StringType(),True),
      StructField("amount",StringType(),True)]),True),
      StructField("category",StringType(),True),
      StructField("itemtype",StringType(),True),
      StructField("roomamenityid",LongType(),True)]),True),
      StructField("inserted_at",StringType(),True)])
      
      data = [
      Row(Row("5daff5ca43b8a36756c23b0f"),
      "com.oyo.transformations.tasks.model.implementations.AuditItemTaskImpl",
      None,
      Row("Geyser Installation(with accessories)",None,Row("INR", "425.0"),None,"INFRASTRUCTURE","PMC",None),
      "201910"
      )
      ]
      
      inc_df = spark.createDataFrame(
        spark.sparkContext.parallelize(data),
        schema
      )
      
      inc_df.union(base_df).show()
      
      

       

      Attachments

        1. test.py
          2 kB
          Xiao Han
        2. coredump.zip
          46.54 MB
          Udit Mehrotra
        3. hs_err_pid13885.log
          322 kB
          Udit Mehrotra
        4. part-00000-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
          10 kB
          Udit Mehrotra

        Activity

          People

            Unassigned Unassigned
            uditme Udit Mehrotra
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: