Hadoop YARN
  1. Hadoop YARN
  2. YARN-744

Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object.

      1. MAPREDUCE-3899-branch-0.23.patch
        9 kB
        Bikas Saha
      2. YARN-744.patch
        3 kB
        Omkar Vinit Joshi
      3. YARN-744-20130711.1.patch
        15 kB
        Omkar Vinit Joshi
      4. YARN-744-20130715.1.patch
        21 kB
        Omkar Vinit Joshi
      5. YARN-744-20130726.1.patch
        14 kB
        Omkar Vinit Joshi
      6. YARN-744.1.patch
        12 kB
        Omkar Vinit Joshi
      7. YARN-744.2.patch
        13 kB
        Omkar Vinit Joshi

        Activity

        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #1614 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1614/)
        YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1614 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1614/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #1588 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1588/)
        YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1588 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1588/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #397 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/397/)
        YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #397 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/397/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Hide
        Bikas Saha added a comment -

        Thanks! Committed to trunk and branch-2

        Show
        Bikas Saha added a comment - Thanks! Committed to trunk and branch-2
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #4766 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4766/)
        YARN-744. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707)

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4766 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4766/ ) YARN-744 . Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. (Omkar Vinit Joshi via bikas) (bikas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1543707 ) /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
        Hide
        Omkar Vinit Joshi added a comment -

        Test failure is not related to this. Opened ticket YARN-1425 to track this.

        Show
        Omkar Vinit Joshi added a comment - Test failure is not related to this. Opened ticket YARN-1425 to track this.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12614693/YARN-744.2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

        org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2486//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614693/YARN-744.2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2486//console This message is automatically generated.
        Hide
        Omkar Vinit Joshi added a comment -

        Thanks Bikas Saha addressed your comments. Attaching a new patch.

        Show
        Omkar Vinit Joshi added a comment - Thanks Bikas Saha addressed your comments. Attaching a new patch.
        Hide
        Bikas Saha added a comment -

        Better name?

        +    AllocateResponseLock res = responseMap.get(applicationAttemptId);
        

        reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()?

        use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()?

        +  private void throwApplicationAttemptDoesNotExistInCacheException(
        +      ApplicationAttemptId appAttemptId) throws YarnException {
        +    String message = "Application doesn't exist in cache "
        +        + appAttemptId;
        +    LOG.error(message);
        +    throw RPCUtil.getRemoteException(message);
        +  }
        

        The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response.

        -      // before returning response, verify in sync
        -      AllocateResponse oldResponse =
        -          responseMap.put(appAttemptId, allocateResponse);
        -      if (oldResponse == null) {
        -        // appAttempt got unregistered, remove it back out
        -        responseMap.remove(appAttemptId);
        -        String message = "App Attempt removed from the cache during allocate"
        -            + appAttemptId;
        -        LOG.error(message);
        -        return resync;
        -      }
        -
        +      res.setAllocateResponse(allocateResponse);
               return allocateResponse;
        
        Show
        Bikas Saha added a comment - Better name? + AllocateResponseLock res = responseMap.get(applicationAttemptId); reuse throwApplicationAttemptDoesNotExistInCacheException() in registerApplicationMaster()? use InvalidApplicationMasterRequestException or a new specific exception instead of generic RPCUtil.throwRemoteException()? + private void throwApplicationAttemptDoesNotExistInCacheException( + ApplicationAttemptId appAttemptId) throws YarnException { + String message = "Application doesn't exist in cache " + + appAttemptId; + LOG.error(message); + throw RPCUtil.getRemoteException(message); + } The new logic is not the same as the old one. If the app is no longer in the cache then it would send a resync response. Now it will send a regular response instead of a resync response. - // before returning response, verify in sync - AllocateResponse oldResponse = - responseMap.put(appAttemptId, allocateResponse); - if (oldResponse == null ) { - // appAttempt got unregistered, remove it back out - responseMap.remove(appAttemptId); - String message = "App Attempt removed from the cache during allocate" - + appAttemptId; - LOG.error(message); - return resync; - } - + res.setAllocateResponse(allocateResponse); return allocateResponse;
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12613497/YARN-744.1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2433//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2433//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613497/YARN-744.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2433//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2433//console This message is automatically generated.
        Hide
        Omkar Vinit Joshi added a comment -

        rebasing patch..

        Show
        Omkar Vinit Joshi added a comment - rebasing patch..
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12594461/YARN-744-20130726.1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1591//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1591//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12594461/YARN-744-20130726.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1591//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1591//console This message is automatically generated.
        Hide
        Omkar Vinit Joshi added a comment -

        Thanks Bikas Saha ...

        AllocateResponseWrapper res

        how about AllocateResponseLock??

        If the wrapper exists then how can the lastResponse be null?

        you are right ..now we no longer need this removing it.

        yeah the test won't actually be able to simulate the race condition mentioned above. Can't think of any other test. Attaching it without a test.

        Show
        Omkar Vinit Joshi added a comment - Thanks Bikas Saha ... AllocateResponseWrapper res how about AllocateResponseLock?? If the wrapper exists then how can the lastResponse be null? you are right ..now we no longer need this removing it. yeah the test won't actually be able to simulate the race condition mentioned above. Can't think of any other test. Attaching it without a test.
        Hide
        Bikas Saha added a comment -

        Patch needs rebase. Probably reuse the recently added UnknownApplication exception.
        Can we pick a better name?

        AllocateResponseWrapper res

        If the wrapper exists then how can the lastResponse be null?

        +    synchronized (res) {
        +      AllocateResponse lastResponse = res.getAllocateResponse();
        +      if (lastResponse == null) {
        +        LOG.error("AppAttemptId doesnt exist in cache " + appAttemptId);
        +        return resync;
        +      }
        

        I am not quite getting fully what the test is doing. Does it fail without the change?

        On the test code itself.
        Is there another solution to make the test work other than making a method protected only for test override purposes?
        Can you make it package private instead of protected? protected implies connotations for derived classes.
        Can we avoid the long sleep() calls and use wait-notify or countdown latches so that we dont waste time in the test. They also help make the test less flaky because of race conditions.

        +          @Override
        +          protected void authorizeRequest(ApplicationAttemptId appAttemptID)
        +              throws YarnException {
        +            int interval = 10;
        +            count.incrementAndGet();
        +            while (count.get() == 1 && interval-- > 0 ) {
        +              try {
        +                Thread.sleep(1000);
        +              } catch (InterruptedException e) {}
        +            }
        +            Assert.assertTrue(count.get() > 1);
        +          }
        +        };
        
        Show
        Bikas Saha added a comment - Patch needs rebase. Probably reuse the recently added UnknownApplication exception. Can we pick a better name? AllocateResponseWrapper res If the wrapper exists then how can the lastResponse be null? + synchronized (res) { + AllocateResponse lastResponse = res.getAllocateResponse(); + if (lastResponse == null ) { + LOG.error( "AppAttemptId doesnt exist in cache " + appAttemptId); + return resync; + } I am not quite getting fully what the test is doing. Does it fail without the change? On the test code itself. Is there another solution to make the test work other than making a method protected only for test override purposes? Can you make it package private instead of protected? protected implies connotations for derived classes. Can we avoid the long sleep() calls and use wait-notify or countdown latches so that we dont waste time in the test. They also help make the test less flaky because of race conditions. + @Override + protected void authorizeRequest(ApplicationAttemptId appAttemptID) + throws YarnException { + int interval = 10; + count.incrementAndGet(); + while (count.get() == 1 && interval-- > 0 ) { + try { + Thread .sleep(1000); + } catch (InterruptedException e) {} + } + Assert.assertTrue(count.get() > 1); + } + };
        Hide
        Omkar Vinit Joshi added a comment -

        Bikas Saha yes... there is similar but different bug though..so Mayank Bansal is fixing it. There we are computing the response and then updating RMNodeImpl asynchronously. If this approach is correct then we can do the similar thing after YARN-245 is in.

        Show
        Omkar Vinit Joshi added a comment - Bikas Saha yes... there is similar but different bug though..so Mayank Bansal is fixing it. There we are computing the response and then updating RMNodeImpl asynchronously. If this approach is correct then we can do the similar thing after YARN-245 is in.
        Hide
        Bikas Saha added a comment -

        Does the same thing apply for ResourceTrackerService too?

        Show
        Bikas Saha added a comment - Does the same thing apply for ResourceTrackerService too?
        Hide
        Omkar Vinit Joshi added a comment -

        We should not be locking on the app attempt id.

        I am not locking on appAttemptId... or AppAttemptImpl...didn't understand your question.

        Why do we need a wrapper?

        We don't have any explicit lock for an application attempt...I am creating a wrapped object to avoid maintaining per application attempt lock. Thereby across application attempt response we can lock on specific attempt.

        I think this is important as we may loose container than what were requested...

        Show
        Omkar Vinit Joshi added a comment - We should not be locking on the app attempt id. I am not locking on appAttemptId... or AppAttemptImpl...didn't understand your question. Why do we need a wrapper? We don't have any explicit lock for an application attempt...I am creating a wrapped object to avoid maintaining per application attempt lock. Thereby across application attempt response we can lock on specific attempt. I think this is important as we may loose container than what were requested...
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12592426/YARN-744-20130715.1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1485//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1485//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12592426/YARN-744-20130715.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1485//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1485//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        btw. it does not look like this is a practical problem. Until we start seeing a few instances of this happening we should probably lower the priority of this jira. I will do that now. Please change it if you think otherwise. A bug that does not manifest itself is not a bug

        Show
        Bikas Saha added a comment - btw. it does not look like this is a practical problem. Until we start seeing a few instances of this happening we should probably lower the priority of this jira. I will do that now. Please change it if you think otherwise. A bug that does not manifest itself is not a bug
        Hide
        Bikas Saha added a comment -

        Why do we need a wrapper?
        We should not be locking on the app attempt id. We should try to find some internal RM object thats unique for the app attempt and lock on that. Also avoid locking the RMAttempImpl object itself since it will block internal async dispatcher.

        Show
        Bikas Saha added a comment - Why do we need a wrapper? We should not be locking on the app attempt id. We should try to find some internal RM object thats unique for the app attempt and lock on that. Also avoid locking the RMAttempImpl object itself since it will block internal async dispatcher.
        Hide
        Zhijie Shen added a comment -

        locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?

        I meant that AMRMClient uses the same appAttemptId, but the uniqueness is not guaranteed, so I agreed with the self-contained "locker" - wrapper.

        Show
        Zhijie Shen added a comment - locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts? I meant that AMRMClient uses the same appAttemptId, but the uniqueness is not guaranteed, so I agreed with the self-contained "locker" - wrapper.
        Hide
        Omkar Vinit Joshi added a comment -

        BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.

        yeah wrote one...

        The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService.

        locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?

        Show
        Omkar Vinit Joshi added a comment - BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit. yeah wrote one... The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService. locking on appAttemptId in case of allocate / RegisterApplicationMaster call won't work. They are coming from client...can't guarantee that they are identical in terms grabbing a lock.. thoughts?
        Hide
        Zhijie Shen added a comment -

        The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService.

        BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.

        Show
        Zhijie Shen added a comment - The passed in appAttemptId for an app currently seems to be the same object, such that it can be used to for synchronized blocks, but I agree with the idea of wrapper, because it is more predictable and stand-alone in ApplicationMasterService. BTW, is it convenient to write a test case for concurrent allocation? Like TestClientRMService#testConcurrentAppSubmit.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12591936/YARN-744-20130711.1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1465//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/1465//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1465//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12591936/YARN-744-20130711.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1465//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/1465//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1465//console This message is automatically generated.
        Hide
        Omkar Vinit Joshi added a comment -

        Adding wrapper around AllocateResponse so that I don't have to maintain new map for appAttemptId. Synchronizing all accesses on that.

        Show
        Omkar Vinit Joshi added a comment - Adding wrapper around AllocateResponse so that I don't have to maintain new map for appAttemptId. Synchronizing all accesses on that.
        Hide
        Omkar Vinit Joshi added a comment -

        Bikas Saha sounds reasonable ..will take a look at it again.

        Show
        Omkar Vinit Joshi added a comment - Bikas Saha sounds reasonable ..will take a look at it again.
        Hide
        Bikas Saha added a comment -

        The same problem exists in other protocols methods in ApplicationMasterService. Would retrieving the app from the rmcontext and locking on the app be any better?

        Show
        Bikas Saha added a comment - The same problem exists in other protocols methods in ApplicationMasterService. Would retrieving the app from the rmcontext and locking on the app be any better?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12590077/YARN-744.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1404//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1404//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590077/YARN-744.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1404//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1404//console This message is automatically generated.
        Hide
        Omkar Vinit Joshi added a comment -

        The problem here is that we retrieve the last response from resource map and then try to grab a lock on it. However after grabbing lock we don't check if the last response in resource map itself got updated or not. That results into a race condition which I am trying to solve here.. After grabbing the lock an additional check has to be made to ensure that lastResponse was not changed in between i.e. no other AM requests were processed.

        Show
        Omkar Vinit Joshi added a comment - The problem here is that we retrieve the last response from resource map and then try to grab a lock on it. However after grabbing lock we don't check if the last response in resource map itself got updated or not. That results into a race condition which I am trying to solve here.. After grabbing the lock an additional check has to be made to ensure that lastResponse was not changed in between i.e. no other AM requests were processed.
        Hide
        Bikas Saha added a comment -

        Yeah. Those complications have left this jira dead for sometime. I should have cancelled the patch long ago. Sorry about that. If you have time could you look at MAPREDUCE-3921 instead

        Show
        Bikas Saha added a comment - Yeah. Those complications have left this jira dead for sometime. I should have cancelled the patch long ago. Sorry about that. If you have time could you look at MAPREDUCE-3921 instead
        Hide
        Siddharth Seth added a comment -

        Bikas, I'm not sure that the locking isn't needed. It's broken the way it is rightnow though.
        ApplicationMasterService seems to be written to handle one bad request (caused by network issues etc). allocate() does more than a scheduler.allocate - node state changes, newly allocated containers etc. All of that needs to be part of a single response. Without the lock - that may not happen. Locking on the AppAttemptId or the AppId itself is probably a better option, and also needs to cover the retrieval of the last response.
        Also, we shouldn't be logging in the allocate call (not at INFO level anyway). That'll flood the RM logs.

        Show
        Siddharth Seth added a comment - Bikas, I'm not sure that the locking isn't needed. It's broken the way it is rightnow though. ApplicationMasterService seems to be written to handle one bad request (caused by network issues etc). allocate() does more than a scheduler.allocate - node state changes, newly allocated containers etc. All of that needs to be part of a single response. Without the lock - that may not happen. Locking on the AppAttemptId or the AppId itself is probably a better option, and also needs to cover the retrieval of the last response. Also, we shouldn't be logging in the allocate call (not at INFO level anyway). That'll flood the RM logs.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12515864/MAPREDUCE-3899-branch-0.23.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515864/MAPREDUCE-3899-branch-0.23.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1925//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        1) removed the lock as its not needed
        2) added comments for when it might be needed
        3) refactored appAttemptId -> applicationAttemptId to make it consistent across the file
        4) added a couple of logs to make it consistent across the processing
        5) a few Eclipse code style changes

        Show
        Bikas Saha added a comment - 1) removed the lock as its not needed 2) added comments for when it might be needed 3) refactored appAttemptId -> applicationAttemptId to make it consistent across the file 4) added a couple of logs to make it consistent across the processing 5) a few Eclipse code style changes

          People

          • Assignee:
            Omkar Vinit Joshi
            Reporter:
            Bikas Saha
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development