Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2964

RM prematurely cancels tokens for jobs that submit jobs (oozie)

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job.

      As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals.

      The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched >10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed.

      1. YARN-2964.1.patch
        14 kB
        Jian He
      2. YARN-2964.2.patch
        14 kB
        Jian He
      3. YARN-2964.3.patch
        14 kB
        Jian He

        Issue Links

          Activity

          Hide
          daryn Daryn Sharp added a comment -

          Vinod Kumar Vavilapalli, can you take a look at this?

          Show
          daryn Daryn Sharp added a comment - Vinod Kumar Vavilapalli , can you take a look at this?
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks for reporting this, Daryn. Bumping it to a Blocker.

          Show
          kasha Karthik Kambatla added a comment - Thanks for reporting this, Daryn. Bumping it to a Blocker.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I checked the code, doubt if there is a bug.

          The first job controlled the cancellation of the token.

          Correct.

          This prevented completion of sub-jobs from canceling tokens used by the main job.

          Only, partially true. More common case to avoid was the completion of the launcher job itself canceling tokens to be used by the sub-jobs.

          As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs.

          AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set. The token gets expired after 7 days. This continues to be the case even after YARN-2704.

          It also appears to schedule multiple redundant renewals.

          Specific references?

          If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed.

          I doubt if this issue happens at all. Are you seeing it on a cluster or is it a theory? IAC, Jian He, we can write a test-case which proves or disproves this?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I checked the code, doubt if there is a bug. The first job controlled the cancellation of the token. Correct. This prevented completion of sub-jobs from canceling tokens used by the main job. Only, partially true. More common case to avoid was the completion of the launcher job itself canceling tokens to be used by the sub-jobs. As of YARN-2704 , the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set. The token gets expired after 7 days. This continues to be the case even after YARN-2704 . It also appears to schedule multiple redundant renewals. Specific references? If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. I doubt if this issue happens at all. Are you seeing it on a cluster or is it a theory? IAC, Jian He , we can write a test-case which proves or disproves this?
          Hide
          jlowe Jason Lowe added a comment -

          AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set.

          As I understand it, the orignial code implicitly had the concept of a first job because the tokens were stored in a Set instead of a Map. Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set. Since the DelegationTokenToRenew only hashes and checks the underlying token, the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set. In the new code, the DelegationTokenToRenew objects are kept in a map instead of a set, so we no longer are implicitly ignoring the same tokens from sub-jobs as we did in the past. This is what allows a sub-job to "override" the request of the launcher job to avoid canceling the token.

          Are you seeing it on a cluster or is it a theory?

          This is occurring on our 2.6 clusters. Our 2.5-based clusters do not exhibit the problem.

          Show
          jlowe Jason Lowe added a comment - AFAIR, this code never had the concept of a first job. An app submits tokens, there was a flat list of tokens, everytime an app finishes, RM will check if the CancelTokensWhenComplete flag is set, and ignore the cancelation of this app if the flag is set. As I understand it, the orignial code implicitly had the concept of a first job because the tokens were stored in a Set instead of a Map. Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set. Since the DelegationTokenToRenew only hashes and checks the underlying token, the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set. In the new code, the DelegationTokenToRenew objects are kept in a map instead of a set, so we no longer are implicitly ignoring the same tokens from sub-jobs as we did in the past. This is what allows a sub-job to "override" the request of the launcher job to avoid canceling the token. Are you seeing it on a cluster or is it a theory? This is occurring on our 2.6 clusters. Our 2.5-based clusters do not exhibit the problem.
          Hide
          jianhe Jian He added a comment -

          the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set.

          One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it?

          Show
          jianhe Jian He added a comment - the difference between shouldCancelAtEnd is ignored and therefore lost when the first job's token is already in the set. One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it?
          Hide
          jlowe Jason Lowe added a comment -

          One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it?

          AFAIK only the Oozie launcher job is requesting tokens not be canceled at the end of the job. If all of the sub-jobs were also requesting that then we wouldn't see the issue since nobody would cancel the token. I'm not sure all of the sub-jobs in all cases are asking for the token to be canceled at the end of the job, but in the current code it only takes one to spoil it for the others.

          Show
          jlowe Jason Lowe added a comment - One question, who is setting the shouldCancelAtEnd flag? is it only the main job or all sub-jobs are setting it? AFAIK only the Oozie launcher job is requesting tokens not be canceled at the end of the job. If all of the sub-jobs were also requesting that then we wouldn't see the issue since nobody would cancel the token. I'm not sure all of the sub-jobs in all cases are asking for the token to be canceled at the end of the job, but in the current code it only takes one to spoil it for the others.
          Hide
          jianhe Jian He added a comment -

          The reason the mapping was introduced is for the purpose of efficiency so that removing tokens for a single application doesn't need to search all tokens in a global set. Maybe quickest way to fix this to change oozie sub-jobs to set this flag.
          Anyways, I can work on a patch to fix this in DelegationTokenRenewer. thanks for reporting this issue !

          Maybe long-term we should have a group Id for a group of applications so that the token lifetime is tied to a group of applications instead of a single application.

          Show
          jianhe Jian He added a comment - The reason the mapping was introduced is for the purpose of efficiency so that removing tokens for a single application doesn't need to search all tokens in a global set. Maybe quickest way to fix this to change oozie sub-jobs to set this flag. Anyways, I can work on a patch to fix this in DelegationTokenRenewer. thanks for reporting this issue ! Maybe long-term we should have a group Id for a group of applications so that the token lifetime is tied to a group of applications instead of a single application.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #6736 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6736/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #6736 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6736/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Hide
          jianhe Jian He added a comment -

          Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set.

          After digging into the code, I found even if we are not canceling the token if the flag is set, we still remove the token from the global set. This means that if sub-jobs doesn't set the flag, it'll be added to the global set again and once the sub-job finishes the token is canceled. I'm wondering how this worked before, Jason Lowe, Daryn Sharp could you shed some light on this ?

          Show
          jianhe Jian He added a comment - Once the token was stashed in the set, subsequent attempts from sub-jobs to store the token would silently be ignored because the token was already in the set. After digging into the code, I found even if we are not canceling the token if the flag is set, we still remove the token from the global set. This means that if sub-jobs doesn't set the flag, it'll be added to the global set again and once the sub-job finishes the token is canceled. I'm wondering how this worked before, Jason Lowe , Daryn Sharp could you shed some light on this ?
          Hide
          jlowe Jason Lowe added a comment -

          IIUC it worked in the past because typically the Oozie launcher job hangs around waiting for all the sub-jobs to complete (e.g.: launcher is running a pig client). Since the launcher job was the first to request the token, it's the one that remains in the set. Any attempt to add the token by a sub-job will not actually add it because of the way the hashcode and equals methods on DelegationTokenToRenew work. Therefore when a sub-job completes and it tries to remove the tokens, this token will not match because the app ID is for the launcher and nto the sub-job.

          Show
          jlowe Jason Lowe added a comment - IIUC it worked in the past because typically the Oozie launcher job hangs around waiting for all the sub-jobs to complete (e.g.: launcher is running a pig client). Since the launcher job was the first to request the token, it's the one that remains in the set. Any attempt to add the token by a sub-job will not actually add it because of the way the hashcode and equals methods on DelegationTokenToRenew work. Therefore when a sub-job completes and it tries to remove the tokens, this token will not match because the app ID is for the launcher and nto the sub-job.
          Hide
          jianhe Jian He added a comment -

          I see, I missed the part that launcher job will wait for sub-jobs to complete, thanks for your explanation !

          Show
          jianhe Jian He added a comment - I see, I missed the part that launcher job will wait for sub-jobs to complete, thanks for your explanation !
          Hide
          kasha Karthik Kambatla added a comment -

          IIRC, the launcher job waits for all actions but the MR action. As an optimization, Oozie started exiting the launcher for pure MR actions. Robert Kanter?

          Show
          kasha Karthik Kambatla added a comment - IIRC, the launcher job waits for all actions but the MR action. As an optimization, Oozie started exiting the launcher for pure MR actions. Robert Kanter ?
          Hide
          rkanter Robert Kanter added a comment -

          Karthik Kambatla is correct. The launcher job waits around for all actions types that typically submit other MR jobs (Pig, Sqoop, Hive, etc) except for the MapReduce action, which finishes immediately after submitting the "real" MR job.

          I just checked, and in the MR launcher, Oozie sets mapreduce.job.complete.cancel.delegation.tokens to true and in the other launchers, Oozie sets it to false. Oozie doesn't set touch this property in any "real" launched MR jobs, so they'll use the default, which I'm guessing is true. Though thinking about this now, it seems like these are backwards, so I'm not sure how that's working right....

          On a related note, we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired; even with the property explicitly set correctly. The problem was that yarn.resourcemanager.delegation.token.renew-interval was set to 24 hours (the default) and if you don't renew (or use?) a delegation token at least every 24 hours, then it automatically expires. Daryn Sharp, perhaps in the original issue this was set to 10 minutes? I haven't had a chance to look into this, but the fix for this particular issue would be to have the launcher job renew the token at some interval.

          Show
          rkanter Robert Kanter added a comment - Karthik Kambatla is correct. The launcher job waits around for all actions types that typically submit other MR jobs (Pig, Sqoop, Hive, etc) except for the MapReduce action, which finishes immediately after submitting the "real" MR job. I just checked, and in the MR launcher, Oozie sets mapreduce.job.complete.cancel.delegation.tokens to true and in the other launchers, Oozie sets it to false . Oozie doesn't set touch this property in any "real" launched MR jobs, so they'll use the default, which I'm guessing is true . Though thinking about this now, it seems like these are backwards, so I'm not sure how that's working right.... On a related note, we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired; even with the property explicitly set correctly. The problem was that yarn.resourcemanager.delegation.token.renew-interval was set to 24 hours (the default) and if you don't renew (or use?) a delegation token at least every 24 hours, then it automatically expires. Daryn Sharp , perhaps in the original issue this was set to 10 minutes? I haven't had a chance to look into this, but the fix for this particular issue would be to have the launcher job renew the token at some interval.
          Hide
          jianhe Jian He added a comment -

          we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired;

          This is because the token is removed from RM DelegationTokenRenewer even though the flag is set to false. Hence, RM won't renew the token. This will cause ooze job to fail after 24 hrs, which should be an existing issue. I'm working on a patch to fix this no worse than before. The patch is based on the assumption that launcher job waits for all actions to complete.

          In addition, I think it may make sense for oozie to propagate this flag to other actions also. Or we can take another solution to have an application group Id to indicate a group of applications like oozie case and tie the token lifetime with the group, and drop this flag completely.

          Show
          jianhe Jian He added a comment - we did see an issue recently where a launched job that took over 24 hours would cause the launcher to fail with a delegation token issue because the token expired; This is because the token is removed from RM DelegationTokenRenewer even though the flag is set to false. Hence, RM won't renew the token. This will cause ooze job to fail after 24 hrs, which should be an existing issue. I'm working on a patch to fix this no worse than before. The patch is based on the assumption that launcher job waits for all actions to complete. In addition, I think it may make sense for oozie to propagate this flag to other actions also. Or we can take another solution to have an application group Id to indicate a group of applications like oozie case and tie the token lifetime with the group, and drop this flag completely.
          Hide
          rkanter Robert Kanter added a comment -

          +1 to the idea of groups. canceling/not canceling the token the way we do now seems kinda hacky.

          Show
          rkanter Robert Kanter added a comment - +1 to the idea of groups. canceling/not canceling the token the way we do now seems kinda hacky.
          Hide
          jianhe Jian He added a comment -

          uploaded a patch:

          • the patch adds a new map which keeps track of all the tokens. If the token is already present, it'll not add a new DelegationTokenToRenew instance for that token.
          • add a conditional check in requestNewHdfsDelegationToken method (missed this in YARN-2704)
          Show
          jianhe Jian He added a comment - uploaded a patch: the patch adds a new map which keeps track of all the tokens. If the token is already present, it'll not add a new DelegationTokenToRenew instance for that token. add a conditional check in requestNewHdfsDelegationToken method (missed this in YARN-2704 )
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12687918/YARN-2964.1.patch
          against trunk revision 1050d42.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
          org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6140//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6140//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6140//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687918/YARN-2964.1.patch against trunk revision 1050d42. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6140//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6140//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6140//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12687918/YARN-2964.1.patch
          against trunk revision 1050d42.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
          org.apache.hadoop.yarn.server.resourcemanager.TestRM

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6142//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6142//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6142//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687918/YARN-2964.1.patch against trunk revision 1050d42. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6142//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6142//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6142//console This message is automatically generated.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #45 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/45/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #45 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/45/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #779 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/779/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #779 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/779/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #42 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/42/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #42 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/42/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1977 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1977/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1977 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1977/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #46 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/46/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #46 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/46/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1996 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1996/)
          YARN-2964. FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1996 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1996/ ) YARN-2964 . FSLeafQueue#assignContainer - document the reason for using both write and read locks. (Tsuyoshi Ozawa via kasha) (kasha: rev f2d150ea1205b77a75c347ace667b4cd060aaf40) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch, Jian! Findbug warnings appear to be unrelated.

          I'm wondering about the change in the removeApplicationFromRenewal method or remove. If a sub-job completes, won't we remove the token from the allTokens map before the launcher job has completed? Then a subsequent sub-job that requests token cancelation can put the token back in the map and cause the token to be canceled when it leaves. I think we need to repeat the logic from the original code before YARN-2704 here, i.e.: only remove the token if the application ID matches. That way the launcher job's token will remain the token in that collection until the launcher job completes.

          This comment doesn't match the code, since the code looks like if any token wants to cancel at the end then we will cancel at the end.

                    // If any of the jobs sharing the same token set shouldCancelAtEnd
                    // to true, we should not cancel the token.
                    if (evt.shouldCancelAtEnd) {
                      dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd;
                    }
          

          I think the logic and comment should be if any job doesn't want to cancel then we won't cancel. The code seems to be trying to do the opposite, so I'm not sure how the unit test is passing. Maybe I'm missing something.

          The info log message added in handleAppSubmitEvent also is misleading, as it says we are setting shouldCancelAtEnd to whatever the event said, when in reality we only set it sometimes. Probably needs to be inside the conditional.

          Wonder if we should be using a Set instead of a Map to track these tokens. Adding an already existing DelegationTokenToRenew in a set will not change the one already there, but with the map a sub-job can clobber the DelegationTokenToRenew that's already there with its own when it does the allTokens.put(dtr.token, dtr).

          Show
          jlowe Jason Lowe added a comment - Thanks for the patch, Jian! Findbug warnings appear to be unrelated. I'm wondering about the change in the removeApplicationFromRenewal method or remove. If a sub-job completes, won't we remove the token from the allTokens map before the launcher job has completed? Then a subsequent sub-job that requests token cancelation can put the token back in the map and cause the token to be canceled when it leaves. I think we need to repeat the logic from the original code before YARN-2704 here, i.e.: only remove the token if the application ID matches. That way the launcher job's token will remain the token in that collection until the launcher job completes. This comment doesn't match the code, since the code looks like if any token wants to cancel at the end then we will cancel at the end. // If any of the jobs sharing the same token set shouldCancelAtEnd // to true , we should not cancel the token. if (evt.shouldCancelAtEnd) { dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd; } I think the logic and comment should be if any job doesn't want to cancel then we won't cancel. The code seems to be trying to do the opposite, so I'm not sure how the unit test is passing. Maybe I'm missing something. The info log message added in handleAppSubmitEvent also is misleading, as it says we are setting shouldCancelAtEnd to whatever the event said, when in reality we only set it sometimes. Probably needs to be inside the conditional. Wonder if we should be using a Set instead of a Map to track these tokens. Adding an already existing DelegationTokenToRenew in a set will not change the one already there, but with the map a sub-job can clobber the DelegationTokenToRenew that's already there with its own when it does the allTokens.put(dtr.token, dtr).
          Hide
          jianhe Jian He added a comment -

          thanks for your comments, Jason !

          I'm wondering about the change in the removeApplicationFromRenewal method or remove.

          If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. So the tokens in removeApplicationFromRenewal will return empty for the sub-job when the sub-job completes. So the token won’t be removed from the allTokens. My only concern with a global set that is that each time an application completes, we end up looping all the applications or worse (each app may have at least one token).

          This comment doesn't match the code

          good catch.. what a mistake.. I might be in the impression the semantics is “shouldKeepAtEnd”, I added one line in the test case to guard against this.

          Wonder if we should be using a Set instead of a Map to track these tokens

          Thought about that too, the reason that switched to a map is to get the DelegationTokenToRenew instance based on the token app provided and change the shouldCancelAtEnd field on submission.

          Show
          jianhe Jian He added a comment - thanks for your comments, Jason ! I'm wondering about the change in the removeApplicationFromRenewal method or remove. If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. So the tokens in removeApplicationFromRenewal will return empty for the sub-job when the sub-job completes. So the token won’t be removed from the allTokens. My only concern with a global set that is that each time an application completes, we end up looping all the applications or worse (each app may have at least one token). This comment doesn't match the code good catch.. what a mistake.. I might be in the impression the semantics is “shouldKeepAtEnd”, I added one line in the test case to guard against this. Wonder if we should be using a Set instead of a Map to track these tokens Thought about that too, the reason that switched to a map is to get the DelegationTokenToRenew instance based on the token app provided and change the shouldCancelAtEnd field on submission.
          Hide
          jianhe Jian He added a comment -

          updated the patch based on some comments from Jason

          Show
          jianhe Jian He added a comment - updated the patch based on some comments from Jason
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12688092/YARN-2964.2.patch
          against trunk revision 07619aa.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRM
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
          org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6149//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6149//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6149//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688092/YARN-2964.2.patch against trunk revision 07619aa. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6149//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6149//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6149//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job.

          Ah, sorry, I missed this critical change from the original patch. However if we don't add the delegation token for each sub-job then I think we have a problem with the following use-case:

          1. Oozie launcher submits a MapReduce sub-job
          2. MapReduce job starts
          3. Oozie launcher job leaves
          4. MapReduce job now running with a token that the RM has "forgotten" and won't be automatically renewed

          We might have had the same issue in this case prior to YARN-2704, since the token would be pulled from the set when the launcher completed.

          Show
          jlowe Jason Lowe added a comment - If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not add DelegationTokenToRenew instance for the sub-job. Ah, sorry, I missed this critical change from the original patch. However if we don't add the delegation token for each sub-job then I think we have a problem with the following use-case: Oozie launcher submits a MapReduce sub-job MapReduce job starts Oozie launcher job leaves MapReduce job now running with a token that the RM has "forgotten" and won't be automatically renewed We might have had the same issue in this case prior to YARN-2704 , since the token would be pulled from the set when the launcher completed.
          Hide
          jianhe Jian He added a comment -

          We might have had the same issue in this case prior to YARN-2704.

          Yes, this is an existing issue. As Robert pointed out in the previous comment, oozie MapReduce sub-job now cannot run beyond 24 hrs. IMO, we can fix this separately ?

          Show
          jianhe Jian He added a comment - We might have had the same issue in this case prior to YARN-2704 . Yes, this is an existing issue. As Robert pointed out in the previous comment, oozie MapReduce sub-job now cannot run beyond 24 hrs. IMO, we can fix this separately ?
          Hide
          jlowe Jason Lowe added a comment -

          Sure, we can fix that as a followup issue since it's no worse than what we had before.

          +1 lgtm, only nit is the new getAllTokens method should be package-private instead of public but not a big deal either way. I assume the test failures are unrelated?

          Show
          jlowe Jason Lowe added a comment - Sure, we can fix that as a followup issue since it's no worse than what we had before. +1 lgtm, only nit is the new getAllTokens method should be package-private instead of public but not a big deal either way. I assume the test failures are unrelated?
          Hide
          jianhe Jian He added a comment -

          I believe the failures are not related. I just changed the visibility and uploaded a new patch to re-kick jenkins.

          Show
          jianhe Jian He added a comment - I believe the failures are not related. I just changed the visibility and uploaded a new patch to re-kick jenkins.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12688133/YARN-2964.3.patch
          against trunk revision b9d4976.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
          org.apache.hadoop.yarn.server.resourcemanager.TestRM

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6150//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6150//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6150//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688133/YARN-2964.3.patch against trunk revision b9d4976. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 14 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6150//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6150//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6150//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. I don't believe the test failures are related since they pass for me locally. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. I don't believe the test failures are related since they pass for me locally. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Jian! I committed this to trunk and branch-2.

          Show
          jlowe Jason Lowe added a comment - Thanks, Jian! I committed this to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #6755 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6755/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6755 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6755/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          jianhe Jian He added a comment -

          thanks for reviewing and committing, Jason !

          Show
          jianhe Jian He added a comment - thanks for reviewing and committing, Jason !
          Hide
          rkanter Robert Kanter added a comment -

          Thanks for fixing this.

          Jian He, Jason Lowe, on the >24 hrs thing, do you think this is something we can/should fix in YARN? My understanding of this issue is that it's by design (there's even a config for the interval). Given that, I'm thinking the proper fix for this is just to have the launcher job periodically renew the token (a fix in OOZIE)?

          Show
          rkanter Robert Kanter added a comment - Thanks for fixing this. Jian He , Jason Lowe , on the >24 hrs thing, do you think this is something we can/should fix in YARN? My understanding of this issue is that it's by design (there's even a config for the interval). Given that, I'm thinking the proper fix for this is just to have the launcher job periodically renew the token (a fix in OOZIE)?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #46 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/46/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #46 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/46/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #780 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/780/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #780 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/780/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1978 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1978/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1978 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1978/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #43 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/43/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #43 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/43/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1997 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1997/)
          YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1997 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1997/ ) YARN-2964 . RM prematurely cancels tokens for jobs that submit jobs (oozie). Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
          Hide
          jianhe Jian He added a comment -

          do you think this is something we can/should fix in YARN?

          I think so. RM is the designated renewer so it should renew the token every so often. But because there's a bug in DelegationTokenRenewer, RM just forgets the token and won't renew the token automatically. So we should fix this in DelegationTokenRenewer to keep track of the token and renew the token properly.

          Show
          jianhe Jian He added a comment - do you think this is something we can/should fix in YARN? I think so. RM is the designated renewer so it should renew the token every so often. But because there's a bug in DelegationTokenRenewer, RM just forgets the token and won't renew the token automatically. So we should fix this in DelegationTokenRenewer to keep track of the token and renew the token properly.
          Hide
          hitliuyi Yi Liu added a comment -

          It seems this JIRA will cause the token is not renewed properly if it's shared by jobs (oozie), I filed a JIRA YARN-3055, please take a look.

          Show
          hitliuyi Yi Liu added a comment - It seems this JIRA will cause the token is not renewed properly if it's shared by jobs (oozie), I filed a JIRA YARN-3055 , please take a look.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.6.1. Ran compilation and TestDelegationTokenRenewer before the push. Patch applied cleanly.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation and TestDelegationTokenRenewer before the push. Patch applied cleanly.

            People

            • Assignee:
              jianhe Jian He
              Reporter:
              daryn Daryn Sharp
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development