Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5383

TaskManager fails with SIGBUS when loading RocksDB

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.3.0
    • Component/s: None
    • Labels:
      None

      Description

      While trying out Flink 1.2, my TaskManager died with the following error while deploying a job:

      2016-12-21 15:57:50,080 INFO  org.apache.flink.runtime.taskmanager.Task                     - Map -> Sink
      : Unnamed (15/16) (50f527e4445479fb1fc9f34394d86d2f) switched from DEPLOYING to RUNNING.
      2016-12-21 15:57:50,081 INFO  org.apache.flink.runtime.taskmanager.Task                     - Map -> Sink
      : Unnamed (16/16) (b4b3d3340de587d729fe83d65eac3e10) switched from DEPLOYING to RUNNING.
      2016-12-21 15:57:50,081 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask           - Using user-
      defined state backend: RocksDB State Backend {isInitialized=false, configuredDbBasePaths=null, initialize
      dDbBasePaths=null, checkpointStreamBackend=File State Backend @ hdfs://nameservice1/shared/checkpoint-dir
      -rocks}.
      2016-12-21 15:57:50,081 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask           - Using user-
      defined state backend: RocksDB State Backend {isInitialized=false, configuredDbBasePaths=null, initialize
      dDbBasePaths=null, checkpointStreamBackend=File State Backend @ hdfs://nameservice1/shared/checkpoint-dir
      -rocks}.
      2016-12-21 15:57:50,223 INFO  org.apache.flink.contrib.streaming.state.RocksDBStateBackend  - Attempting 
      to load RocksDB native library and store it at '/yarn/nm/usercache/longrunning/appcache/application_14821
      56101125_0016'
      
      LogType:taskmanager.out
      Log Upload Time:Wed Dec 21 16:00:35 +0000 2016
      LogLength:959
      Log Contents:
      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGBUS (0x7) at pc=0x00007fe745fd596a, pid=7414, tid=140630801725184
      #
      # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # C  [ld-linux-x86-64.so.2+0x1a96a]  realloc+0x2bfa
      #
      

      the error report file contained the following frames:

      Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
      j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;)V+0
      j  java.lang.ClassLoader.loadLibrary1(Ljava/lang/Class;Ljava/io/File;)Z+302
      j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+2
      j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+48
      j  java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+57
      j  java.lang.System.load(Ljava/lang/String;)V+7
      j  org.rocksdb.NativeLibraryLoader.loadLibraryFromJar(Ljava/lang/String;)V+14
      j  org.rocksdb.NativeLibraryLoader.loadLibrary(Ljava/lang/String;)V+22
      j  org.apache.flink.contrib.streaming.state.RocksDBStateBackend.ensureRocksDBIsLoaded(Ljava/lang/String;)V+62
      j  org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(Lorg/apache/flink/runtime/execution/Environment;Lorg/apache/flink/api/common/JobID;Ljava/lang/String;Lorg/apache/flink/api/common/typeutils/TypeSerializer;ILorg/apache/flink/runtime/state/KeyGroupRange;Lorg/apache/flink/runtime/query/TaskKvStateRegistry;)Lorg/apache/flink/runtime/state/AbstractKeyedStateBackend;+16
      j  org.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(Lorg/apache/flink/api/common/typeutils/TypeSerializer;ILorg/apache/flink/runtime/state/KeyGroupRange;)Lorg/apache/flink/runtime/state/AbstractKeyedStateBackend;+137
      

      I saw this error only once so far. I'll report again if it happens more frequently.

        Activity

        Hide
        StephanEwen Stephan Ewen added a comment -

        Reading up un SIGBUS - it seems to occur when mapping a file (here the JNI library) and then some other process truncates the file.
        That probably happens due to concurrent overwrites.

        I would expect this to be fixed as a side effect of https://github.com/apache/flink/commit/3070ff9a6d9de47a4713d4b4952929f8c00043b1

        Show
        StephanEwen Stephan Ewen added a comment - Reading up un SIGBUS - it seems to occur when mapping a file (here the JNI library) and then some other process truncates the file. That probably happens due to concurrent overwrites. I would expect this to be fixed as a side effect of https://github.com/apache/flink/commit/3070ff9a6d9de47a4713d4b4952929f8c00043b1
        Hide
        rmetzger Robert Metzger added a comment -

        I can confirm that the mentioned commit is fixing the issue! I'm closing the JIRA.

        Show
        rmetzger Robert Metzger added a comment - I can confirm that the mentioned commit is fixing the issue! I'm closing the JIRA.

          People

          • Assignee:
            StephanEwen Stephan Ewen
            Reporter:
            rmetzger Robert Metzger
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development