Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7052

RM SchedulingMonitor gives no indication why the spawned thread crashed.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 2.8.2
    • Component/s: yarn
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In YARN-7051, we ran into a case where the preemption monitor thread hung with no indication of why.

      The preemption monitor is started by the SchedulingExecutorService from SchedulingMonitor#serviceStart. Once an uncaught throwable happens, nothing ever gets the result of the future, the thread running the preemption monitor never dies, and it never gets rescheduled.

      If HadoopExecutor were used, it would at least provide a HadoopScheduledThreadPoolExecutor that logs the exception if one happens.

        Activity

        Hide
        eepayne Eric Payne added a comment -

        Using HadoopExecutor may not be feasible.

        The following is used to launch the preemption thread:

        SchedulingMonitor#serviceStart
            ses = Executors.newSingleThreadScheduledExecutor(new ThreadFactory() {
        ...
            handler = ses.scheduleAtFixedRate(new PreemptionChecker(),
                0, monitorInterval, TimeUnit.MILLISECONDS);
        

        HadoopExecutors provides a newSingleThreadScheduledExecutor interface, but it just turns around and calls Executors#newSingleThreadScheduledExecutor. The HadoopExecutors#newSingleThreadScheduledExecutor method does not provide the HadoopScheduledThreadPoolExecutor wrapper in the return value of that interface, so you don't get the logging benefits if you use HadoopExecutors#newSingleThreadScheduledExecutor

        Alternatively, we could have the thread itself catch and handle throwables.

        The thread being launched by SchedulingMonitor#serviceStart is calling PreemptionChecker#run, which only handles YarnRuntimeException. Anything else will cause the thread to hang and not get rescheduled.

        I suggest that another solution would be to handle other throwables, log them, and either re-throw or cancel the thread.

        Show
        eepayne Eric Payne added a comment - Using HadoopExecutor may not be feasible. The following is used to launch the preemption thread: SchedulingMonitor#serviceStart ses = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { ... handler = ses.scheduleAtFixedRate( new PreemptionChecker(), 0, monitorInterval, TimeUnit.MILLISECONDS); HadoopExecutors provides a newSingleThreadScheduledExecutor interface, but it just turns around and calls Executors#newSingleThreadScheduledExecutor . The HadoopExecutors#newSingleThreadScheduledExecutor method does not provide the HadoopScheduledThreadPoolExecutor wrapper in the return value of that interface, so you don't get the logging benefits if you use HadoopExecutors#newSingleThreadScheduledExecutor Alternatively, we could have the thread itself catch and handle throwables. The thread being launched by SchedulingMonitor#serviceStart is calling PreemptionChecker#run , which only handles YarnRuntimeException . Anything else will cause the thread to hang and not get rescheduled. I suggest that another solution would be to handle other throwables, log them, and either re-throw or cancel the thread.
        Hide
        eepayne Eric Payne added a comment -

        I suggest that another solution would be to handle other throwables, log them, and either re-throw or cancel the thread.

        After an off-line discussion with Jason Lowe, I think it would be better to catch throwables, log them, and skip the invocation. Preemption does not have persistent structures across invocations, plus it doesn't modify any existing leaf queue structures.

        Since preemption can be an important productivity feature for certain use cases, I am marking this critical for 2.8.2.

        Show
        eepayne Eric Payne added a comment - I suggest that another solution would be to handle other throwables, log them, and either re-throw or cancel the thread. After an off-line discussion with Jason Lowe , I think it would be better to catch throwables, log them, and skip the invocation. Preemption does not have persistent structures across invocations, plus it doesn't modify any existing leaf queue structures. Since preemption can be an important productivity feature for certain use cases, I am marking this critical for 2.8.2.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 1m 22s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
              trunk Compile Tests
        +1 mvninstall 16m 4s trunk passed
        +1 compile 0m 34s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 findbugs 1m 4s trunk passed
        +1 javadoc 0m 23s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 34s the patch passed
        +1 compile 0m 34s the patch passed
        +1 javac 0m 34s the patch passed
        -0 checkstyle 0m 23s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1)
        +1 mvnsite 0m 35s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 11s the patch passed
        +1 javadoc 0m 20s the patch passed
              Other Tests
        -1 unit 49m 52s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 19s The patch does not generate ASF License warnings.
        75m 39s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
        Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands
          org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-7052
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883601/YARN-7052.001.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux d5ea45e62c4a 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / c2cb7ea
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/17122/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17122/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17122/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17122/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 1m 22s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 16m 4s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 39s trunk passed +1 findbugs 1m 4s trunk passed +1 javadoc 0m 23s trunk passed       Patch Compile Tests +1 mvninstall 0m 34s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed -0 checkstyle 0m 23s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) +1 mvnsite 0m 35s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 11s the patch passed +1 javadoc 0m 20s the patch passed       Other Tests -1 unit 49m 52s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 75m 39s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands   org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA   org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7052 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883601/YARN-7052.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d5ea45e62c4a 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c2cb7ea Default Java 1.8.0_144 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/17122/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/17122/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17122/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17122/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        eepayne Eric Payne added a comment -

        The following unit tests are all passing for me in my environment:

          org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands
          org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels 
        

        The TestContainerAllocation unit test is the same as YARN-7044

        Show
        eepayne Eric Payne added a comment - The following unit tests are all passing for me in my environment: org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels The TestContainerAllocation unit test is the same as YARN-7044
        Hide
        jlowe Jason Lowe added a comment -

        I'm a bit hesitant to catch Throwable instead of Exception when suppressing, but that's exactly what the thread pool executor is going to do as well.

        +1 lgtm. I'll commit this later today if no objections, cleaning up the unused import checkstyle nit during the process.

        Show
        jlowe Jason Lowe added a comment - I'm a bit hesitant to catch Throwable instead of Exception when suppressing, but that's exactly what the thread pool executor is going to do as well. +1 lgtm. I'll commit this later today if no objections, cleaning up the unused import checkstyle nit during the process.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks, Eric! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.

        Show
        jlowe Jason Lowe added a comment - Thanks, Eric! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12244 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12244/)
        YARN-7052. RM SchedulingMonitor gives no indication why the spawned (jlowe: rev 39a9dc8e4a6e1d13658867ad756878d3dd6352b0)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/SchedulingMonitor.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12244 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12244/ ) YARN-7052 . RM SchedulingMonitor gives no indication why the spawned (jlowe: rev 39a9dc8e4a6e1d13658867ad756878d3dd6352b0) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/SchedulingMonitor.java

          People

          • Assignee:
            eepayne Eric Payne
            Reporter:
            eepayne Eric Payne
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development