Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6451

Add RM monitor validating metrics invariants

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0-alpha4
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      For SLS runs, as well as for live test clusters (and maybe prod), it would be useful to have a mechanism to continuously check whether core invariants of the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly respected, certain latencies within expected range, etc..)

      1. YARN-6451.v0.patch
        8 kB
        Carlo Curino
      2. YARN-6451.v1.patch
        12 kB
        Carlo Curino
      3. YARN-6451.v2.patch
        17 kB
        Carlo Curino
      4. YARN-6451.v3.patch
        18 kB
        Carlo Curino
      5. YARN-6451.v4.patch
        22 kB
        Carlo Curino
      6. YARN-6451.v5.patch
        22 kB
        Carlo Curino

        Issue Links

          Activity

          Hide
          curino Carlo Curino added a comment - - edited

          The patch provides and initial implementation of this idea. It does the following simple thing:

          Every time the InvariantsChecker is invoked it:

          1. poll the QueueMetrics (we could/should extend it and make it configurable)
          2. it checks a list of invariants (loaded from config file)
          3. logs any error as a warning

          The idea is to use this in few ways:

          1. For SLS-based unit/integration tests that ensure correctness of the overall RM subsystem. E.g., running for a while, and checking that important invariants are never violated (e.g., resource being non-negative, or locality going from usually good to very bad after a check-in).
          2. Performance-based analysis via SLS (and fixed environments), e.g., allocation-latency starting to get worse after a certain change.
          3. In production environments to "anticipate" customer griping.

          An extension of this is to make the "action" triggered when an invariant is violated configurable, e.g., in some cases a log is all is needed, while other times one may want an alert, or even a system.exit() if things are really bad (and/or the deployment allows it).

          Tan, Wangda, Jason Lowe, Karthik Kambatla, Subru Krishnan, Konstantinos Karanasos, Arun Suresh, Chris Douglas, Hitesh Sharma, Roni Burd, Kishore Chaliparambil: Thoughts?

          Show
          curino Carlo Curino added a comment - - edited The patch provides and initial implementation of this idea. It does the following simple thing: Every time the InvariantsChecker is invoked it: poll the QueueMetrics (we could/should extend it and make it configurable) it checks a list of invariants (loaded from config file) logs any error as a warning The idea is to use this in few ways: For SLS-based unit/integration tests that ensure correctness of the overall RM subsystem. E.g., running for a while, and checking that important invariants are never violated (e.g., resource being non-negative, or locality going from usually good to very bad after a check-in). Performance-based analysis via SLS (and fixed environments), e.g., allocation-latency starting to get worse after a certain change. In production environments to "anticipate" customer griping. An extension of this is to make the "action" triggered when an invariant is violated configurable, e.g., in some cases a log is all is needed, while other times one may want an alert, or even a system.exit() if things are really bad (and/or the deployment allows it). Tan, Wangda , Jason Lowe , Karthik Kambatla , Subru Krishnan , Konstantinos Karanasos , Arun Suresh , Chris Douglas , Hitesh Sharma , Roni Burd , Kishore Chaliparambil : Thoughts?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 57s trunk passed
          +1 compile 0m 36s trunk passed
          +1 checkstyle 0m 27s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 21s trunk passed
          +1 javadoc 0m 24s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          -0 checkstyle 0m 26s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 13 new + 0 unchanged - 0 fixed = 13 total (was 0)
          +1 mvnsite 0m 37s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 18s the patch passed
          +1 javadoc 0m 21s the patch passed
          -1 unit 41m 53s hadoop-yarn-server-resourcemanager in the patch failed.
          -1 asflicense 0m 18s The patch generated 1 ASF License warnings.
          65m 41s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestResourceTrackerService
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing
            hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6451
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862390/YARN-6451.v0.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 115c09b35a7b 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / a49fac5
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15553/testReport/
          asflicense https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/15553/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 57s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 21s trunk passed +1 javadoc 0m 24s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed -0 checkstyle 0m 26s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 13 new + 0 unchanged - 0 fixed = 13 total (was 0) +1 mvnsite 0m 37s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 18s the patch passed +1 javadoc 0m 21s the patch passed -1 unit 41m 53s hadoop-yarn-server-resourcemanager in the patch failed. -1 asflicense 0m 18s The patch generated 1 ASF License warnings. 65m 41s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestResourceTrackerService   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing   hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6451 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862390/YARN-6451.v0.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 115c09b35a7b 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a49fac5 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15553/testReport/ asflicense https://builds.apache.org/job/PreCommit-YARN-Build/15553/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15553/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          curino Carlo Curino added a comment -

          Fixed checkstyles and ASF warning, improved tests, metrics coverage, and ability to throw on violation. We should choose which type of exception to throw later (currently RuntimeException)

          Show
          curino Carlo Curino added a comment - Fixed checkstyles and ASF warning, improved tests, metrics coverage, and ability to throw on violation. We should choose which type of exception to throw later (currently RuntimeException)
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 40s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 33s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 1m 2s trunk passed
          +1 javadoc 0m 20s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          -0 checkstyle 0m 22s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 5 new + 0 unchanged - 0 fixed = 5 total (was 0)
          +1 mvnsite 0m 31s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 2s The patch has no ill-formed XML file.
          +1 findbugs 1m 4s the patch passed
          +1 javadoc 0m 20s the patch passed
          +1 unit 39m 49s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          62m 8s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6451
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862421/YARN-6451.v1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle
          uname Linux 476cdf8efefc 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / e7167e4
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15557/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15557/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/15557/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 40s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 33s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 2s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed -0 checkstyle 0m 22s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 5 new + 0 unchanged - 0 fixed = 5 total (was 0) +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 1m 4s the patch passed +1 javadoc 0m 20s the patch passed +1 unit 39m 49s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 62m 8s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6451 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862421/YARN-6451.v1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle uname Linux 476cdf8efefc 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e7167e4 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15557/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15557/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15557/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Interesting idea. For some of these invariants, would it make more sense to put an assert-like hook in the metric code itself? I'm thinking why hope that a periodic interval happens to catch the metric being negative when we can have the metric itself protest when someone tries to set it below zero? As a bonus, we'd have access to the stacktrace that triggered it.

          I could see this periodic approach being really useful for more complicated expressions like validating stats across users, across queues, etc. where it's tricky/expensive to evaluate it on a single metric update.

          Show
          jlowe Jason Lowe added a comment - Interesting idea. For some of these invariants, would it make more sense to put an assert-like hook in the metric code itself? I'm thinking why hope that a periodic interval happens to catch the metric being negative when we can have the metric itself protest when someone tries to set it below zero? As a bonus, we'd have access to the stacktrace that triggered it. I could see this periodic approach being really useful for more complicated expressions like validating stats across users, across queues, etc. where it's tricky/expensive to evaluate it on a single metric update.
          Hide
          curino Carlo Curino added a comment -

          Jason Lowe I think the inline asserts might be useful particularly for simple "range" type checks (like the silly examples I have so far).
          The one concern would be whether the checks done for each state transition of the metrics become to expensive. The other issue
          is that while certain invariants are universally true, other might be deployment specific, and by having them externally loaded/configured
          like I do in this example, it is easier to customize them for a specific workload. E.g., in some of our clusters apps are self-throttling
          and when they ask for containers they should be receiving them very quickly, so we would like to establish some invariant on allocation
          latency, which cannot be assert generally.

          All in all, I would like to foster more invariant checking in our codebase, as a way to complement more specific unit tests---this
          little patch is a step in that direction.
          In particular, given the work done in SLS, I think we can easily have integration tests that run large portions of the codebase (e.g., the RM)
          simulating a large workload, and check that important invariants (e.g., complex one like you mentioned) are respected.

          Show
          curino Carlo Curino added a comment - Jason Lowe I think the inline asserts might be useful particularly for simple "range" type checks (like the silly examples I have so far). The one concern would be whether the checks done for each state transition of the metrics become to expensive. The other issue is that while certain invariants are universally true, other might be deployment specific, and by having them externally loaded/configured like I do in this example, it is easier to customize them for a specific workload. E.g., in some of our clusters apps are self-throttling and when they ask for containers they should be receiving them very quickly, so we would like to establish some invariant on allocation latency, which cannot be assert generally. All in all, I would like to foster more invariant checking in our codebase, as a way to complement more specific unit tests---this little patch is a step in that direction. In particular, given the work done in SLS, I think we can easily have integration tests that run large portions of the codebase (e.g., the RM) simulating a large workload, and check that important invariants (e.g., complex one like you mentioned) are respected.
          Hide
          curino Carlo Curino added a comment -

          Some more cleanup.

          Show
          curino Carlo Curino added a comment - Some more cleanup.
          Hide
          chris.douglas Chris Douglas added a comment -

          Cool, I hadn't seen the javax.script package before. Throwing a bespoke exception can also be configured to halt the JVM and call back to a debugger, which is a nice touch for the SLS case.

          • The invariants can be precompiled, to avoid the parsing/compilation overhead for each iteration.
          • If not invoking a debugger, then it'd be nice to know the bindings when the invariant doesn't hold.
          • The invariant check could be part of the metrics2.MetricsCollector, particularly if it's possible to filter the metrics it gathers based on the configured invariants.
          Show
          chris.douglas Chris Douglas added a comment - Cool, I hadn't seen the javax.script package before. Throwing a bespoke exception can also be configured to halt the JVM and call back to a debugger, which is a nice touch for the SLS case. The invariants can be precompiled, to avoid the parsing/compilation overhead for each iteration. If not invoking a debugger, then it'd be nice to know the bindings when the invariant doesn't hold. The invariant check could be part of the metrics2.MetricsCollector , particularly if it's possible to filter the metrics it gathers based on the configured invariants.
          Hide
          curino Carlo Curino added a comment -

          Thanks Chris Douglas for the feedback.

          1. I did the precompilation as you suggested (I didn't know the Javascrip engine is a Compilable subclass of the general ScriptEngine one), it helps somewhat. Poking at performance, I also found that the longer I ran it the slower it got... it was due to the collector accumulating records. I know clear it at each iteration. Combined this brought us down to about 1ms per iteration if we keep all invariant separate (one per line of our script file), and 0.07ms per invocation if we combine them in a single large invariant (with all individual invariants in && ).
            Pros and cons, when invariants are violated the log line is harder to read if combined, but perf is much better. In the current example of invariants.txt I will leave this with one invariant per line, so slower but easier to understand---works?
          1. I added this to the logging/exception message. In particular, I am pruning the bindings, so that the message should contain only the bindings used in the failing invariant (bar performance tricks above, this makes for a very readable output).
          1. As we discussed offline, while it is true we could push the checking deep into the collector and get a little closer to detect the issues to when they happen, since we run say every second with this, it is unlikely we will improve detection much (we shave sub-millis time, but we might still be 0.5sec off in average from when the violation occurred). Short of checking at every metrics update (very costly), we probably can only detect issues a little after they have happened. This seems anyway much better than days later when a customer complains
          Show
          curino Carlo Curino added a comment - Thanks Chris Douglas for the feedback. I did the precompilation as you suggested (I didn't know the Javascrip engine is a Compilable subclass of the general ScriptEngine one), it helps somewhat. Poking at performance, I also found that the longer I ran it the slower it got... it was due to the collector accumulating records. I know clear it at each iteration. Combined this brought us down to about 1ms per iteration if we keep all invariant separate (one per line of our script file), and 0.07ms per invocation if we combine them in a single large invariant (with all individual invariants in && ). Pros and cons, when invariants are violated the log line is harder to read if combined, but perf is much better. In the current example of invariants.txt I will leave this with one invariant per line, so slower but easier to understand---works? I added this to the logging/exception message. In particular, I am pruning the bindings, so that the message should contain only the bindings used in the failing invariant (bar performance tricks above, this makes for a very readable output). As we discussed offline, while it is true we could push the checking deep into the collector and get a little closer to detect the issues to when they happen, since we run say every second with this, it is unlikely we will improve detection much (we shave sub-millis time, but we might still be 0.5sec off in average from when the violation occurred). Short of checking at every metrics update (very costly), we probably can only detect issues a little after they have happened. This seems anyway much better than days later when a customer complains
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 20s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 0s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          -0 checkstyle 0m 22s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 9 new + 0 unchanged - 0 fixed = 9 total (was 0)
          +1 mvnsite 0m 32s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 1m 6s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 38m 39s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          60m 42s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ac17dc
          JIRA Issue YARN-6451
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863676/YARN-6451.v3.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle
          uname Linux 291e01583ccb 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / f1de2c8
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15653/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15653/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/15653/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 20s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed -0 checkstyle 0m 22s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 9 new + 0 unchanged - 0 fixed = 9 total (was 0) +1 mvnsite 0m 32s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 38m 39s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 60m 42s Subsystem Report/Notes Docker Image:yetus/hadoop:0ac17dc JIRA Issue YARN-6451 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863676/YARN-6451.v3.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle uname Linux 291e01583ccb 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f1de2c8 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/15653/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15653/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15653/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          chris.douglas Chris Douglas added a comment -

          when invariants are violated the log line is harder to read if combined, but perf is much better. In the current example of invariants.txt I will leave this with one invariant per line, so slower but easier to understand---works?

          This could evaluate the combined expression, and only if it detects some violation, iterate over the set of expressions to print specific error messages. Though shaving fractions of a millisecond off the validation check is probably not significant.

          +1 overall. For future versions:

          • The invariant checker might want to use bindings across contexts; this would be hard to express as subtypes of InvariantsChecker. For example, if one wanted to check some invariant using values from the scheduler and the metrics, there isn't a good way to compose the two with inheritance. That said, in the current RM it's hard to correlate values collected from multiple components without reasoning about their mutual consistency in a brittle, ad hoc way. How invariants are loaded and how errors are handled could also be abstracted, but (IMHO) that'd be premature. This is approachable as-is.
          • The unit test is kind of light
          • This could print a warning when it starts up, since it's mostly for testing. If it's accidentally deployed in a production setting, it should show up in the log. The RM refuses to start if invariants.txt is missing?
          Show
          chris.douglas Chris Douglas added a comment - when invariants are violated the log line is harder to read if combined, but perf is much better. In the current example of invariants.txt I will leave this with one invariant per line, so slower but easier to understand---works? This could evaluate the combined expression, and only if it detects some violation, iterate over the set of expressions to print specific error messages. Though shaving fractions of a millisecond off the validation check is probably not significant. +1 overall. For future versions: The invariant checker might want to use bindings across contexts; this would be hard to express as subtypes of InvariantsChecker . For example, if one wanted to check some invariant using values from the scheduler and the metrics, there isn't a good way to compose the two with inheritance. That said, in the current RM it's hard to correlate values collected from multiple components without reasoning about their mutual consistency in a brittle, ad hoc way. How invariants are loaded and how errors are handled could also be abstracted, but (IMHO) that'd be premature. This is approachable as-is. The unit test is kind of light This could print a warning when it starts up, since it's mostly for testing. If it's accidentally deployed in a production setting, it should show up in the log. The RM refuses to start if invariants.txt is missing?
          Hide
          curino Carlo Curino added a comment -

          Good call. I implemented the fast check all invariant at once, and re-check them individually to give better logs. Seems to work well (<0.1ms runtime for the normal case and pretty logs of the form:

          Invariant "AvailableVCores >= 0" is NOT holding, with bindings: {AvailableVCores=-1}
          
          1. Agreed on the general next steps for generalizing this. Per our offline discussion, it is probably worth to refactor and extend this later, if/when we start using it more heavily.
          2. I have increased (slightly) the test coverage, though the bulk of test usage will come in follow-up patches where we combine SLS and this mechanics to have basically integration tests for the overall RM.
          3. I added the warning printout during init.
          Show
          curino Carlo Curino added a comment - Good call. I implemented the fast check all invariant at once, and re-check them individually to give better logs. Seems to work well (<0.1ms runtime for the normal case and pretty logs of the form: Invariant "AvailableVCores >= 0" is NOT holding, with bindings: {AvailableVCores=-1} Agreed on the general next steps for generalizing this. Per our offline discussion, it is probably worth to refactor and extend this later, if/when we start using it more heavily. I have increased (slightly) the test coverage, though the bulk of test usage will come in follow-up patches where we combine SLS and this mechanics to have basically integration tests for the overall RM. I added the warning printout during init.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 28s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 34s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 1s trunk passed
          +1 javadoc 0m 20s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          -1 findbugs 1m 7s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
          +1 javadoc 0m 18s the patch passed
          -1 unit 39m 42s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          62m 15s



          Reason Tests
          FindBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            org.apache.hadoop.yarn.server.resourcemanager.monitor.invariants.MetricsInvariantChecker.init(Configuration, RMContext, PreemptableResourceScheduler) concatenates strings using + in a loop At MetricsInvariantChecker.java:using + in a loop At MetricsInvariantChecker.java:[line 120]
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ac17dc
          JIRA Issue YARN-6451
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863700/YARN-6451.v4.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle
          uname Linux 42d85b8eb00c 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6ed9d36
          Default Java 1.8.0_121
          findbugs v3.0.0
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/15656/artifact/patchprocess/new-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html
          unit https://builds.apache.org/job/PreCommit-YARN-Build/15656/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15656/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/15656/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 28s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 34s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 1s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. -1 findbugs 1m 7s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) +1 javadoc 0m 18s the patch passed -1 unit 39m 42s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 62m 15s Reason Tests FindBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager   org.apache.hadoop.yarn.server.resourcemanager.monitor.invariants.MetricsInvariantChecker.init(Configuration, RMContext, PreemptableResourceScheduler) concatenates strings using + in a loop At MetricsInvariantChecker.java:using + in a loop At MetricsInvariantChecker.java: [line 120] Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:0ac17dc JIRA Issue YARN-6451 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863700/YARN-6451.v4.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml findbugs checkstyle uname Linux 42d85b8eb00c 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6ed9d36 Default Java 1.8.0_121 findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/15656/artifact/patchprocess/new-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html unit https://builds.apache.org/job/PreCommit-YARN-Build/15656/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15656/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15656/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          chris.douglas Chris Douglas added a comment -

          +1 I committed this. Thanks Carlo

          Show
          chris.douglas Chris Douglas added a comment - +1 I committed this. Thanks Carlo
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11601 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11601/)
          YARN-6451. Add RM monitor validating metrics invariants. Contributed by (cdouglas: rev af8e9842d2ca566528e09d905b609f1cf160d367)

          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/TestMetricsInvariantChecker.java
          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/MetricsInvariantChecker.java
          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/InvariantsChecker.java
          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/package-info.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/invariants.txt
          • (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/InvariantViolationException.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11601 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11601/ ) YARN-6451 . Add RM monitor validating metrics invariants. Contributed by (cdouglas: rev af8e9842d2ca566528e09d905b609f1cf160d367) (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/TestMetricsInvariantChecker.java (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/MetricsInvariantChecker.java (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/InvariantsChecker.java (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/package-info.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/invariants.txt (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/invariants/InvariantViolationException.java
          Hide
          curino Carlo Curino added a comment -

          Thanks Chris Douglas, I might cherry-pick it back to branch-2 later on.

          Show
          curino Carlo Curino added a comment - Thanks Chris Douglas , I might cherry-pick it back to branch-2 later on.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Carlo Curino/Chris Douglas,

          Beyond metrics, i think there're many information are not inside metrics, such as order of container allocation to ensure FIFO/fairness, etc. Have you thought about how to formalize these requirements?

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Carlo Curino / Chris Douglas , Beyond metrics, i think there're many information are not inside metrics, such as order of container allocation to ensure FIFO/fairness, etc. Have you thought about how to formalize these requirements?
          Hide
          curino Carlo Curino added a comment - - edited

          I see two or three alternatives:

          1. Hard-coding the most important invariants in a programmatic way, you see an example of this in: YARN-6473, where I poke the ReservationSystem and YarnScheduler to check whether their data-structures remain in sync during execution. This is more minimalistic/efficient, but any extension requires code changes. For example, you can maintain an observer of container allocations, and check that certain ordering properties are respected.
          2. Expand the mechanics of YARN-6451 by adding "bindings" for many more parts of the RM internal state, which one is allowed to mentioned in the invariants.txt file. Metrics was a natural starting point, as the cost of gathering is already there, and their names are externally known. To minimize the cost, we could load the invariants.txt expressions, and then limit the "state" we probe to be the least one covering the needs of our expressions.
          3. (discussing with Chris Douglas another option emerged) Leverage compiler APIs / aspects / dependency-injection type of tricks to dynamically modify the code that does the binding work, to cover whatever appears in invariants.txt file. This is obviously the richest one, though it has some maintainability issues.

          In YARN-6547 I propose a simple way of combining YARN-6363 and YARN-6451 capabilities to run tests that check an SLS run for common invariants (both during and at the end of the run). That is mostly a mechanism patch, but we can work together to define very tight yet robust invariants for specific runs.

          Show
          curino Carlo Curino added a comment - - edited I see two or three alternatives: Hard-coding the most important invariants in a programmatic way, you see an example of this in: YARN-6473 , where I poke the ReservationSystem and YarnScheduler to check whether their data-structures remain in sync during execution. This is more minimalistic/efficient, but any extension requires code changes. For example, you can maintain an observer of container allocations, and check that certain ordering properties are respected. Expand the mechanics of YARN-6451 by adding "bindings" for many more parts of the RM internal state, which one is allowed to mentioned in the invariants.txt file. Metrics was a natural starting point, as the cost of gathering is already there, and their names are externally known. To minimize the cost, we could load the invariants.txt expressions, and then limit the "state" we probe to be the least one covering the needs of our expressions. (discussing with Chris Douglas another option emerged) Leverage compiler APIs / aspects / dependency-injection type of tricks to dynamically modify the code that does the binding work, to cover whatever appears in invariants.txt file. This is obviously the richest one, though it has some maintainability issues. In YARN-6547 I propose a simple way of combining YARN-6363 and YARN-6451 capabilities to run tests that check an SLS run for common invariants (both during and at the end of the run). That is mostly a mechanism patch, but we can work together to define very tight yet robust invariants for specific runs.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Carlo Curino for your responses.

          I personally think #3 is the good way to go, I agree the approach to get low-hanging fruit first via existing metrics-based mechanisms.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Carlo Curino for your responses. I personally think #3 is the good way to go, I agree the approach to get low-hanging fruit first via existing metrics-based mechanisms.

            People

            • Assignee:
              curino Carlo Curino
              Reporter:
              curino Carlo Curino
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development