Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6640

AM heartbeat stuck when responseId overflows MAX_INT

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 2.8.2
    • Component/s: None
    • Labels:
      None

      Description

      The current code in ApplicationMasterService:
      if ((request.getResponseId() + 1) == lastResponse.getResponseId())

      {/* old heartbeat */ return lastResponse;}

      else if (request.getResponseId() + 1 < lastResponse.getResponseId())

      { throw ... }

      process the heartbeat...

      When a heartbeat comes in, in usual case we are expecting request.getResponseId() == lastResponse.getResponseId(). The “if“ is for the duplicate heartbeat that’s one step old, the “else if” is to throw and complain for heartbeats more than two steps old, otherwise we accept the new heartbeat and process it.

      So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest heartbeat comes in with responseId == MAX_INT. However reponseId + 1 will be MIN_INT, and we will fall into the “else if” case and RM will throw. Then we are stuck here…

      1. YARN-6640.v1.patch
        10 kB
        Botong Huang
      2. YARN-6640.v2.patch
        9 kB
        Botong Huang

        Issue Links

          Activity

          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 15m 19s trunk passed
          +1 compile 0m 37s trunk passed
          +1 checkstyle 0m 27s trunk passed
          +1 mvnsite 0m 37s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 5s trunk passed
          +1 javadoc 0m 24s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 0m 33s the patch passed
          +1 javac 0m 33s the patch passed
          +1 checkstyle 0m 24s the patch passed
          +1 mvnsite 0m 33s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 3s the patch passed
          +1 javadoc 0m 20s the patch passed
          +1 unit 39m 31s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          63m 58s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6640
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871045/YARN-6640.v1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a269ced972db 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 73ecb19
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16093/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/16093/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 15m 19s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 5s trunk passed +1 javadoc 0m 24s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 0m 33s the patch passed +1 javac 0m 33s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 33s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 20s the patch passed +1 unit 39m 31s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 63m 58s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6640 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871045/YARN-6640.v1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a269ced972db 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 73ecb19 Default Java 1.8.0_131 findbugs v3.1.0-RC1 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16093/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/16093/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          botong Botong Huang added a comment -

          Tan, Wangda (Jian He and Arun Suresh), can you please take a look at this patch? Thanks in advance!

          Show
          botong Botong Huang added a comment - Tan, Wangda ( Jian He and Arun Suresh ), can you please take a look at this patch? Thanks in advance!
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Botong Huang for reporting and working on this case. I think this is pretty severe issue which we should fix and backport to previous releases. Marked as blocker for all > 2.8 releases.

          We have two choices, one is change type of response id from int to long. I personally don't prefer that because even long could be exhausted if they have an app runs for hundred years , and it has compatibility issue as well.

          I prefer to reuse int like what you did in the patch, if value equals MAX_INT, we will set it back to 0 and handle the special checking logic. I don't suggest have a special reserved-id, this makes code become confusing.

          Another potential problem is, we only fail when request.responseId < lastResponse.responseId - 1, I think we should also fail when request.responseId > lastResponseId.

          Thoughts?

          + Jason Lowe.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Botong Huang for reporting and working on this case. I think this is pretty severe issue which we should fix and backport to previous releases. Marked as blocker for all > 2.8 releases. We have two choices, one is change type of response id from int to long. I personally don't prefer that because even long could be exhausted if they have an app runs for hundred years , and it has compatibility issue as well. I prefer to reuse int like what you did in the patch, if value equals MAX_INT, we will set it back to 0 and handle the special checking logic. I don't suggest have a special reserved-id, this makes code become confusing. Another potential problem is, we only fail when request.responseId < lastResponse.responseId - 1, I think we should also fail when request.responseId > lastResponseId . Thoughts? + Jason Lowe .
          Hide
          jlowe Jason Lowe added a comment -

          I agree we don't need to change from int to long here, we can just manage the wrap. It would be simpler if we didn't have to manage the special case of the response being -1 to indicate we've never heard from this AM. Otherwise we would only need to handle to valid cases: current == last or current + 1 == last. Anything else would be an error.

          Given we need to reserve a value for "not registered" we can use Wangda's idea to explicitly manage the looping at some convenient positive value rather than allowing the response ID to ever go negative after it has registered. Then we leave all the negative values for other "special values" if we ever need them. It's not like we need all these possible values to know when we're out of sync given we only expect this value or the previous.

          For example, I think we could do something like the following (note: haven't tested, straight off the top of my head):

                if (((request.getResponseId() + 1) & Integer.MAX_VALUE) == lastResponse.getResponseId()) {
                  /* old heartbeat */
                  return lastResponse;
                } else if (request.getResponseId() != lastResponse.getResponseId()) {
                  String message =
                      "Invalid responseId in AllocateRequest from application attempt: "
                          + appAttemptId + ", expect responseId to be "
                          + (lastResponse.getResponseId() + 1);
                  throw new InvalidApplicationMasterRequestException(message);
                }
          [...]
                    response.setResponseId((lastResponse.getResponseId() + 1) & Integer.MAX_VALUE);
          

          There should be a little helper function that takes the current response ID and returns the next one to make it easier to read.

          Show
          jlowe Jason Lowe added a comment - I agree we don't need to change from int to long here, we can just manage the wrap. It would be simpler if we didn't have to manage the special case of the response being -1 to indicate we've never heard from this AM. Otherwise we would only need to handle to valid cases: current == last or current + 1 == last. Anything else would be an error. Given we need to reserve a value for "not registered" we can use Wangda's idea to explicitly manage the looping at some convenient positive value rather than allowing the response ID to ever go negative after it has registered. Then we leave all the negative values for other "special values" if we ever need them. It's not like we need all these possible values to know when we're out of sync given we only expect this value or the previous. For example, I think we could do something like the following (note: haven't tested, straight off the top of my head): if (((request.getResponseId() + 1) & Integer .MAX_VALUE) == lastResponse.getResponseId()) { /* old heartbeat */ return lastResponse; } else if (request.getResponseId() != lastResponse.getResponseId()) { String message = "Invalid responseId in AllocateRequest from application attempt: " + appAttemptId + ", expect responseId to be " + (lastResponse.getResponseId() + 1); throw new InvalidApplicationMasterRequestException(message); } [...] response.setResponseId((lastResponse.getResponseId() + 1) & Integer .MAX_VALUE); There should be a little helper function that takes the current response ID and returns the next one to make it easier to read.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 13m 20s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 35s trunk passed
          +1 findbugs 0m 56s trunk passed
          +1 javadoc 0m 22s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 32s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 3s the patch passed
          +1 javadoc 0m 19s the patch passed
                Other Tests
          -1 unit 45m 47s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 13s The patch does not generate ASF License warnings.
          67m 15s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6640
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871045/YARN-6640.v1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ffcba91d5067 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c379310
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          unit https://builds.apache.org/job/PreCommit-YARN-Build/17077/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17077/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/17077/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 13m 20s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 35s trunk passed +1 findbugs 0m 56s trunk passed +1 javadoc 0m 22s trunk passed       Patch Compile Tests +1 mvninstall 0m 31s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 32s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 19s the patch passed       Other Tests -1 unit 45m 47s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 13s The patch does not generate ASF License warnings. 67m 15s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6640 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871045/YARN-6640.v1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ffcba91d5067 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c379310 Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17077/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17077/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17077/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Agree with Jason Lowe provided approach.

          Jason could you also share your thoughts for below question?

          Another potential problem is, we only fail when request.responseId < lastResponse.responseId - 1, I think we should also fail when request.responseId > lastResponseId.

          Show
          leftnoteasy Wangda Tan added a comment - Agree with Jason Lowe provided approach. Jason could you also share your thoughts for below question? Another potential problem is, we only fail when request.responseId < lastResponse.responseId - 1, I think we should also fail when request.responseId > lastResponseId.
          Hide
          botong Botong Huang added a comment -

          Hi Tan, Wangda, I think Jason Lowe's approach already handles the case request.responseId > lastResponseId. I will update the patch soon. Thanks for the comments!

          Show
          botong Botong Huang added a comment - Hi Tan, Wangda , I think Jason Lowe 's approach already handles the case request.responseId > lastResponseId. I will update the patch soon. Thanks for the comments!
          Hide
          jlowe Jason Lowe added a comment -

          Yes, sorry I didn't call it out explicitly. I agree that we should only expect a request to have the same ID we sent in the last response or the previous ID. Anything else should be an error since the AM is out of sync with the RM. A sane AM could send a request ID that is far larger than the RM's current ID after the RM restarts, but I think that case should already be covered by the !hasApplicationMasterRegistered check before we compare the request ID to the last response ID.

          Show
          jlowe Jason Lowe added a comment - Yes, sorry I didn't call it out explicitly. I agree that we should only expect a request to have the same ID we sent in the last response or the previous ID. Anything else should be an error since the AM is out of sync with the RM. A sane AM could send a request ID that is far larger than the RM's current ID after the RM restarts, but I think that case should already be covered by the !hasApplicationMasterRegistered check before we compare the request ID to the last response ID.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Jason Lowe, great, I think we're on the same page for this! Botong Huang, could you address sanity check when request.responseId > lastResponse.responseId in this patch as well?

          Show
          leftnoteasy Wangda Tan added a comment - Jason Lowe , great, I think we're on the same page for this! Botong Huang , could you address sanity check when request.responseId > lastResponse.responseId in this patch as well?
          Hide
          leftnoteasy Wangda Tan added a comment -

          Botong Huang, I might misread your comment, will review the updated patch and let you know.

          Show
          leftnoteasy Wangda Tan added a comment - Botong Huang , I might misread your comment, will review the updated patch and let you know.
          Hide
          botong Botong Huang added a comment -

          Sure, v2 patch uploaded. Thanks!

          Show
          botong Botong Huang added a comment - Sure, v2 patch uploaded. Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 16m 26s trunk passed
          +1 compile 0m 43s trunk passed
          +1 checkstyle 0m 28s trunk passed
          +1 mvnsite 0m 42s trunk passed
          +1 findbugs 1m 10s trunk passed
          +1 javadoc 0m 22s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 35s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          +1 checkstyle 0m 25s the patch passed
          +1 mvnsite 0m 38s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 14s the patch passed
          +1 javadoc 0m 20s the patch passed
                Other Tests
          -1 unit 44m 34s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          70m 5s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6640
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883359/YARN-6640.v2.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3e6e65f7572a 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4249172
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          unit https://builds.apache.org/job/PreCommit-YARN-Build/17090/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17090/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/17090/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 16m 26s trunk passed +1 compile 0m 43s trunk passed +1 checkstyle 0m 28s trunk passed +1 mvnsite 0m 42s trunk passed +1 findbugs 1m 10s trunk passed +1 javadoc 0m 22s trunk passed       Patch Compile Tests +1 mvninstall 0m 35s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 38s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 14s the patch passed +1 javadoc 0m 20s the patch passed       Other Tests -1 unit 44m 34s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 70m 5s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6640 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883359/YARN-6640.v2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3e6e65f7572a 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4249172 Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17090/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17090/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17090/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          botong Botong Huang added a comment -

          Unit test failure is irrelevant and being tracked under YARN-7044.

          Show
          botong Botong Huang added a comment - Unit test failure is irrelevant and being tracked under YARN-7044 .
          Hide
          leftnoteasy Wangda Tan added a comment -

          +1, thanks Botong Huang.

          Show
          leftnoteasy Wangda Tan added a comment - +1, thanks Botong Huang .
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm as well. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm as well. Committing this.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12240 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12240/)
          YARN-6640. AM heartbeat stuck when responseId overflows MAX_INT. (jlowe: rev 3a4e861169dc3da9df0158ba6f44a9bc8576e217)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12240 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12240/ ) YARN-6640 . AM heartbeat stuck when responseId overflows MAX_INT. (jlowe: rev 3a4e861169dc3da9df0158ba6f44a9bc8576e217) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Botong Huang for the contribution and to Wangda Tan for additional review! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.

          Show
          jlowe Jason Lowe added a comment - Thanks to Botong Huang for the contribution and to Wangda Tan for additional review! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.
          Hide
          botong Botong Huang added a comment -

          Great, thanks Jason Lowe, Wangda Tan for the review and advise!

          Show
          botong Botong Huang added a comment - Great, thanks Jason Lowe , Wangda Tan for the review and advise!
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Botong Huang / Jason Lowe / Wangda Tan, we do use these responseIds between RM -> NM, MapReduce AM -> Task also. Haven't gone through the details of this JIRA, but we should see if the bug here is applicable to those cases too?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Botong Huang / Jason Lowe / Wangda Tan , we do use these responseIds between RM -> NM, MapReduce AM -> Task also. Haven't gone through the details of this JIRA, but we should see if the bug here is applicable to those cases too?
          Hide
          botong Botong Huang added a comment -

          Hi Vinod Kumar Vavilapalli, yes the same issue exists NM->RM hearbeat as well. This's actually in my todo list. Just opened YARN-7102 for it and will work on it soon. Thanks!

          Show
          botong Botong Huang added a comment - Hi Vinod Kumar Vavilapalli , yes the same issue exists NM->RM hearbeat as well. This's actually in my todo list. Just opened YARN-7102 for it and will work on it soon. Thanks!

            People

            • Assignee:
              botong Botong Huang
              Reporter:
              botong Botong Huang
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development