Hadoop YARN
  1. Hadoop YARN
  2. YARN-744

Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object.

      1. YARN-744.2.patch
        13 kB
        Omkar Vinit Joshi
      2. YARN-744.1.patch
        12 kB
        Omkar Vinit Joshi
      3. YARN-744-20130726.1.patch
        14 kB
        Omkar Vinit Joshi
      4. YARN-744-20130715.1.patch
        21 kB
        Omkar Vinit Joshi
      5. YARN-744-20130711.1.patch
        15 kB
        Omkar Vinit Joshi
      6. YARN-744.patch
        3 kB
        Omkar Vinit Joshi
      7. MAPREDUCE-3899-branch-0.23.patch
        9 kB
        Bikas Saha

        Issue Links

          Activity

          Hide
          Bikas Saha added a comment -

          1) removed the lock as its not needed
          2) added comments for when it might be needed
          3) refactored appAttemptId -> applicationAttemptId to make it consistent across the file
          4) added a couple of logs to make it consistent across the processing
          5) a few Eclipse code style changes

          Show
          Bikas Saha added a comment - 1) removed the lock as its not needed 2) added comments for when it might be needed 3) refactored appAttemptId -> applicationAttemptId to make it consistent across the file 4) added a couple of logs to make it consistent across the processing 5) a few Eclipse code style changes
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515864/MAPREDUCE-3899-branch-0.23.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515864/MAPREDUCE-3899-branch-0.23.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//console This message is automatically generated.
          Hide
          Siddharth Seth added a comment -

          Bikas, I'm not sure that the locking isn't needed. It's broken the way it is rightnow though.
          ApplicationMasterService seems to be written to handle one bad request (caused by network issues etc). allocate() does more than a scheduler.allocate - node state changes, newly allocated containers etc. All of that needs to be part of a single response. Without the lock - that may not happen. Locking on the AppAttemptId or the AppId itself is probably a better option, and also needs to cover the retrieval of the last response.
          Also, we shouldn't be logging in the allocate call (not at INFO level anyway). That'll flood the RM logs.

          Show
          Siddharth Seth added a comment - Bikas, I'm not sure that the locking isn't needed. It's broken the way it is rightnow though. ApplicationMasterService seems to be written to handle one bad request (caused by network issues etc). allocate() does more than a scheduler.allocate - node state changes, newly allocated containers etc. All of that needs to be part of a single response. Without the lock - that may not happen. Locking on the AppAttemptId or the AppId itself is probably a better option, and also needs to cover the retrieval of the last response. Also, we shouldn't be logging in the allocate call (not at INFO level anyway). That'll flood the RM logs.
          Hide
          Bikas Saha added a comment -

          Yeah. Those complications have left this jira dead for sometime. I should have cancelled the patch long ago. Sorry about that. If you have time could you look at MAPREDUCE-3921 instead

          Show
          Bikas Saha added a comment - Yeah. Those complications have left this jira dead for sometime. I should have cancelled the patch long ago. Sorry about that. If you have time could you look at MAPREDUCE-3921 instead
          Hide
          Omkar Vinit Joshi added a comment -

          The problem here is that we retrieve the last response from resource map and then try to grab a lock on it. However after grabbing lock we don't check if the last response in resource map itself got updated or not. That results into a race condition which I am trying to solve here.. After grabbing the lock an additional check has to be made to ensure that lastResponse was not changed in between i.e. no other AM requests were processed.

          Show
          Omkar Vinit Joshi added a comment - The problem here is that we retrieve the last response from resource map and then try to grab a lock on it. However after grabbing lock we don't check if the last response in resource map itself got updated or not. That results into a race condition which I am trying to solve here.. After grabbing the lock an additional check has to be made to ensure that lastResponse was not changed in between i.e. no other AM requests were processed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590077/YARN-744.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1404//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1404//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590077/YARN-744.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1404//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1404//console This message is automatically generated.
          Hide
          Bikas Saha added a comment -

          The same problem exists in other protocols methods in ApplicationMasterService. Would retrieving the app from the rmcontext and locking on the app be any better?

          Show
          Bikas Saha added a comment - The same problem exists in other protocols methods in ApplicationMasterService. Would retrieving the app from the rmcontext and locking on the app be any better?
          Hide
          Omkar Vinit Joshi added a comment -

          Bikas Saha sounds reasonable ..will take a look at it again.

          Show
          Omkar Vinit Joshi added a comment - Bikas Saha sounds reasonable ..will take a look at it again.
          Hide
          Omkar Vinit Joshi added a comment -

          Adding wrapper around AllocateResponse so that I don't have to maintain new map for appAttemptId. Synchronizing all accesses on that.

          Show
          Omkar Vinit Joshi added a comment - Adding wrapper around AllocateResponse so that I don't have to maintain new map for appAttemptId. Synchronizing all accesses on that.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12591936/YARN-744-20130711.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1465//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/1465//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1465//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12591936/YARN-744-20130711.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1465//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/1465//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1465//console This message is automatically generated.
          Hide
          Zhijie Shen added a comment -

          The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService.

          BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.

          Show
          Zhijie Shen added a comment - The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService. BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.
          Hide
          Omkar Vinit Joshi added a comment -

          BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.

          yeah wrote one...

          The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService.

          locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?

          Show
          Omkar Vinit Joshi added a comment - BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit. yeah wrote one... The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService. locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?
          Hide
          Zhijie Shen added a comment -

          locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?

          I meant that AMRMClient uses the same appAttemptId, but the uniqueness is not guaranteed, so I agreed with the self-contained "locker" - wrapper.

          Show
          Zhijie Shen added a comment - locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts? I meant that AMRMClient uses the same appAttemptId, but the uniqueness is not guaranteed, so I agreed with the self-contained "locker" - wrapper.
          Hide
          Bikas Saha added a comment -

          Why do we need a wrapper?
          We should not be locking on the app attempt id. We should try to find some internal RM object thats unique for the app attempt and lock on that. Also avoid locking the RMAttempImpl object itself since it will block internal async dispatcher.

          Show
          Bikas Saha added a comment - Why do we need a wrapper? We should not be locking on the app attempt id. We should try to find some internal RM object thats unique for the app attempt and lock on that. Also avoid locking the RMAttempImpl object itself since it will block internal async dispatcher.
          Hide
          Bikas Saha added a comment -

          btw. it does not look like this is a practical problem. Until we start seeing a few instances of this happening we should probably lower the priority of this jira. I will do that now. Please change it if you think otherwise. A bug that does not manifest itself is not a bug

          Show
          Bikas Saha added a comment - btw. it does not look like this is a practical problem. Until we start seeing a few instances of this happening we should probably lower the priority of this jira. I will do that now. Please change it if you think otherwise. A bug that does not manifest itself is not a bug
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12592426/YARN-744-20130715.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1485//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1485//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12592426/YARN-744-20130715.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1485//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1485//console This message is automatically generated.
          Hide
          Omkar Vinit Joshi added a comment -

          We should not be locking on the app attempt id.

          I am not locking on appAttemptId... or AppAttemptImpl...didn't understand your question.

          Why do we need a wrapper?

          We don't have any explicit lock for an application attempt...I am creating a wrapped object to avoid maintaining per application attempt lock. Thereby across application attempt response we can lock on specific attempt.

          I think this is important as we may loose container than what were requested...

          Show
          Omkar Vinit Joshi added a comment - We should not be locking on the app attempt id. I am not locking on appAttemptId... or AppAttemptImpl...didn't understand your question. Why do we need a wrapper? We don't have any explicit lock for an application attempt...I am creating a wrapped object to avoid maintaining per application attempt lock. Thereby across application attempt response we can lock on specific attempt. I think this is important as we may loose container than what were requested...
          Hide
          Bikas Saha added a comment -

          Does the same thing apply for ResourceTrackerService too?

          Show
          Bikas Saha added a comment - Does the same thing apply for ResourceTrackerService too?
          Hide
          Omkar Vinit Joshi added a comment -

          Bikas Saha yes... there is similar but different bug though..so Mayank Bansal is fixing it. There we are computing the response and then updating RMNodeImpl asynchronously. If this approach is correct then we can do the similar thing after YARN-245 is in.

          Show
          Omkar Vinit Joshi added a comment - Bikas Saha yes... there is similar but different bug though..so Mayank Bansal is fixing it. There we are computing the response and then updating RMNodeImpl asynchronously. If this approach is correct then we can do the similar thing after YARN-245 is in.
          Hide
          Bikas Saha added a comment -

          Patch needs rebase. Probably reuse the recently added UnknownApplication exception.
          Can we pick a better name?

          AllocateResponseWrapper res

          If the wrapper exists then how can the lastResponse be null?

          +    synchronized (res) {
          +      AllocateResponse lastResponse = res.getAllocateResponse();
          +      if (lastResponse == null) {
          +        LOG.error("AppAttemptId doesnt exist in cache " + appAttemptId);
          +        return resync;
          +      }
          

          I am not quite getting fully what the test is doing. Does it fail without the change?

          On the test code itself.
          Is there another solution to make the test work other than making a method protected only for test override purposes?
          Can you make it package private instead of protected? protected implies connotations for derived classes.
          Can we avoid the long sleep() calls and use wait-notify or countdown latches so that we dont waste time in the test. They also help make the test less flaky because of race conditions.

          +          @Override
          +          protected void authorizeRequest(ApplicationAttemptId appAttemptID)
          +              throws YarnException {
          +            int interval = 10;
          +            count.incrementAndGet();
          +            while (count.get() == 1 && interval-- > 0 ) {
          +              try {
          +                Thread.sleep(1000);
          +              } catch (InterruptedException e) {}
          +            }
          +            Assert.assertTrue(count.get() > 1);
          +          }
          +        };
          
          Show
          Bikas Saha added a comment - Patch needs rebase. Probably reuse the recently added UnknownApplication exception. Can we pick a better name? AllocateResponseWrapper res If the wrapper exists then how can the lastResponse be null? + synchronized (res) { + AllocateResponse lastResponse = res.getAllocateResponse(); + if (lastResponse == null ) { + LOG.error( "AppAttemptId doesnt exist in cache " + appAttemptId); + return resync; + } I am not quite getting fully what the test is doing. Does it fail without the change? On the test code itself. Is there another solution to make the test work other than making a method protected only for test override purposes? Can you make it package private instead of protected? protected implies connotations for derived classes. Can we avoid the long sleep() calls and use wait-notify or countdown latches so that we dont waste time in the test. They also help make the test less flaky because of race conditions. + @Override + protected void authorizeRequest(ApplicationAttemptId appAttemptID) + throws YarnException { + int interval = 10; + count.incrementAndGet(); + while (count.get() == 1 && interval-- > 0 ) { + try { + Thread .sleep(1000); + } catch (InterruptedException e) {} + } + Assert.assertTrue(count.get() > 1); + } + };
          Hide
          Omkar Vinit Joshi added a comment -

          Thanks Bikas Saha ...

          AllocateResponseWrapper res

          how about AllocateResponseLock??

          If the wrapper exists then how can the lastResponse be null?

          you are right ..now we no longer need this removing it.

          yeah the test won't actually be able to simulate the race condition mentioned above. Can't think of any other test. Attaching it without a test.

          Show
          Omkar Vinit Joshi added a comment - Thanks Bikas Saha ... AllocateResponseWrapper res how about AllocateResponseLock?? If the wrapper exists then how can the lastResponse be null? you are right ..now we no longer need this removing it. yeah the test won't actually be able to simulate the race condition mentioned above. Can't think of any other test. Attaching it without a test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12594461/YARN-744-20130726.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1591//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1591//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12594461/YARN-744-20130726.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1591//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1591//console This message is automatically generated.
          Hide
          Omkar Vinit Joshi added a comment -

          rebasing patch..

          Show
          Omkar Vinit Joshi added a comment - rebasing patch..
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12613497/YARN-744.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2433//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2433//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613497/YARN-744.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2433//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2433//console This message is automatically generated.
          Hide
          Bikas Saha added a comment -

          Better name?

          +    AllocateResponseLock res = responseMap.get(applicationAttemptId);
          

          reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()?

          use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()?

          +  private void throwApplicationAttemptDoesNotExistInCacheException(
          +      ApplicationAttemptId appAttemptId) throws YarnException {
          +    String message = "Application doesn't exist in cache "
          +        + appAttemptId;
          +    LOG.error(message);
          +    throw RPCUtil.getRemoteException(message);
          +  }
          

          The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response.

          -      // before returning response, verify in sync
          -      AllocateResponse oldResponse =
          -          responseMap.put(appAttemptId, allocateResponse);
          -      if (oldResponse == null) {
          -        // appAttempt got unregistered, remove it back out
          -        responseMap.remove(appAttemptId);
          -        String message = "App Attempt removed from the cache during allocate"
          -            + appAttemptId;
          -        LOG.error(message);
          -        return resync;
          -      }
          -
          +      res.setAllocateResponse(allocateResponse);
                 return allocateResponse;
          
          Show
          Bikas Saha added a comment - Better name? + AllocateResponseLock res = responseMap.get(applicationAttemptId); reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()? use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()? + private void throwApplicationAttemptDoesNotExistInCacheException( + ApplicationAttemptId appAttemptId) throws YarnException { + String message = "Application doesn't exist in cache " + + appAttemptId; + LOG.error(message); + throw RPCUtil.getRemoteException(message); + } The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response. - // before returning response, verify in sync - AllocateResponse oldResponse = - responseMap.put(appAttemptId, allocateResponse); - if (oldResponse == null ) { - // appAttempt got unregistered, remove it back out - responseMap.remove(appAttemptId); - String message = "App Attempt removed from the cache during allocate" - + appAttemptId; - LOG.error(message); - return resync; - } - + res.setAllocateResponse(allocateResponse); return allocateResponse;
          Hide
          Omkar Vinit Joshi added a comment -

          Thanks Bikas Saha addressed your comments. Attaching a new patch.

          Show
          Omkar Vinit Joshi added a comment - Thanks Bikas Saha addressed your comments. Attaching a new patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614693/YARN-744.2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2486//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614693/YARN-744.2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2486//console This message is automatically generated.
          Hide
          Omkar Vinit Joshi added a comment -

          Test failure is not related to this. Opened ticket YARN-1425 to track this.

          Show
          Omkar Vinit Joshi added a comment - Test failure is not related to this. Opened ticket YARN-1425 to track this.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #4766 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4766/)
          YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4766 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4766/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Hide
          Bikas Saha added a comment -

          Thanks! Committed to trunk and branch-2

          Show
          Bikas Saha added a comment - Thanks! Committed to trunk and branch-2
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #397 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/397/)
          YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #397 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/397/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1588 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1588/)
          YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1588 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1588/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1614 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1614/)
          YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1614 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1614/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java

            People

            • Assignee:
              Omkar Vinit Joshi
              Reporter:
              Bikas Saha
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development