Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.2
    • Component/s: mrv2, nodemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      RM sends a reboot command to NM in some cases, like when it gets lost and rejoins back. In such a case, NM should act on the command and reboot/reinitalize itself.

      This is akin to TT reinitialize on order from JT. We will need to shutdown all the services properly and reinitialize - this should automatically take care of killing of containers, cleaning up local temporary files etc.

      1. MR-3034.txt
        15 kB
        Robert Joseph Evans
      2. MAPREDUCE-3034.patch
        6 kB
        Devaraj K
      3. MAPREDUCE-3034-1.patch
        8 kB
        Devaraj K
      4. MAPREDUCE-3034-2.patch
        8 kB
        Devaraj K
      5. MAPREDUCE-3034-3.patch
        8 kB
        Devaraj K
      6. MAPREDUCE-3034-4.patch
        8 kB
        Devaraj K
      7. MAPREDUCE-3034-5.patch
        8 kB
        Eric Payne
      8. MAPREDUCE-3034-20120305.txt
        8 kB
        Vinod Kumar Vavilapalli
      9. MAPREDUCE-3034-20120305.1.txt
        9 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1011 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1011/)
          MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310)

          Result = FAILURE
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1011 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1011/ ) MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310) Result = FAILURE acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #217 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/217/)
          Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311)

          Result = FAILURE
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #217 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/217/ ) Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311) Result = FAILURE acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #976 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/976/)
          MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #976 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/976/ ) MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #189 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/189/)
          Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #189 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/189/ ) Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-0.23-Commit #643 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/643/)
          Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #643 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/643/ ) Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1839 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1839/)
          MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1839 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1839/ ) MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Commit #633 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/633/)
          Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #633 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/633/ ) Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1913 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1913/)
          MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1913 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1913/ ) MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Commit #644 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/644/)
          Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311)

          Result = FAILURE
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #644 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/644/ ) Merge -c 1297310 from trunk to branch-0.23 to fix MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297311) Result = FAILURE acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297311 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1846 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1846/)
          MAPREDUCE-3034. Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310)

          Result = FAILURE
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1846 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1846/ ) MAPREDUCE-3034 . Ensure NodeManager reboots itself on direction from ResourceManager. Contributed by Devaraj K & Eric Payne. (Revision 1297310) Result = FAILURE acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1297310 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks Devaraj and Eric!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks Devaraj and Eric!
          Hide
          Arun C Murthy added a comment -

          +1, thanks for the update Vinod.

          The MR1 test failures are unrelated and seem like a temp. blip... all are of the form:

          org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access `/tmp/hadoop-jenkins/mapred/staging/jenkins755067366/.staging/job_local_0001': No such file or directory
          
          
          Show
          Arun C Murthy added a comment - +1, thanks for the update Vinod. The MR1 test failures are unrelated and seem like a temp. blip... all are of the form: org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access `/tmp/hadoop-jenkins/mapred/staging/jenkins755067366/.staging/job_local_0001': No such file or directory
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12517152/MAPREDUCE-3034-20120305.1.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.mapred.TestJobClientGetJob
          org.apache.hadoop.mapred.TestMRWithDistributedCache
          org.apache.hadoop.mapred.TestLocalModeWithNewApis

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2006//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2006//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12517152/MAPREDUCE-3034-20120305.1.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.mapred.TestJobClientGetJob org.apache.hadoop.mapred.TestMRWithDistributedCache org.apache.hadoop.mapred.TestLocalModeWithNewApis +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2006//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2006//console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Suppressing findBugs warning about sys.exit...

          Show
          Vinod Kumar Vavilapalli added a comment - Suppressing findBugs warning about sys.exit...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12517129/MAPREDUCE-3034-20120305.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12517129/MAPREDUCE-3034-20120305.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2004//console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Taking liberty and making changes myself, this issue has been in a bit of development hell.

          Will push this in when Jenkins returns.

          Show
          Vinod Kumar Vavilapalli added a comment - Taking liberty and making changes myself, this issue has been in a bit of development hell. Will push this in when Jenkins returns.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Looked through the patch, fine overall.

          Making trivial edit like adding comments to the code, renaming a few fields etc. Also we don't need to make changes to CompositeService, we just need to remove the previous shut-down hook, doing this too.

          Also the fact that we are creating a new NodeManager underneath when RM asks to reboot is a problem. This issue is because we cannot reinit a Service, will file a separate ticket.

          Show
          Vinod Kumar Vavilapalli added a comment - Looked through the patch, fine overall. Making trivial edit like adding comments to the code, renaming a few fields etc. Also we don't need to make changes to CompositeService, we just need to remove the previous shut-down hook, doing this too. Also the fact that we are creating a new NodeManager underneath when RM asks to reboot is a problem. This issue is because we cannot reinit a Service, will file a separate ticket.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Looking at the patch for final review/commit..

          Show
          Vinod Kumar Vavilapalli added a comment - Looking at the patch for final review/commit..
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12516844/MAPREDUCE-3034-5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1984//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1984//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12516844/MAPREDUCE-3034-5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1984//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1984//console This message is automatically generated.
          Hide
          Eric Payne added a comment -

          The MAPREDUCE-3034-5.patch applies to trunk as well.

          Show
          Eric Payne added a comment - The MAPREDUCE-3034 -5.patch applies to trunk as well.
          Hide
          Eric Payne added a comment -

          I upmerged to the latest 0.23.2 branch.

          I ran several "before and after" tests to be sure that job times did not increase
          nor did heap usage increase for NM or client when comparing before RM stop/start
          vs. after RM stop/start.

          Show
          Eric Payne added a comment - I upmerged to the latest 0.23.2 branch. I ran several "before and after" tests to be sure that job times did not increase nor did heap usage increase for NM or client when comparing before RM stop/start vs. after RM stop/start.
          Hide
          Eric Payne added a comment -

          Upmerging to latest 0.23.2 branch.

          Show
          Eric Payne added a comment - Upmerging to latest 0.23.2 branch.
          Hide
          Eric Payne added a comment -

          @Devaraj,

          I have upmerged the patch to branch-0.23.2 and re-tested on a secure 10-node cluster.

          Should I go ahead and upload the umperged patch?

          Thanks,
          -Eric

          Show
          Eric Payne added a comment - @Devaraj, I have upmerged the patch to branch-0.23.2 and re-tested on a secure 10-node cluster. Should I go ahead and upload the umperged patch? Thanks, -Eric
          Hide
          Eric Payne added a comment -

          @Devaraj,

          Can you please upmerge the patch to the latest code in http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23

          Thanks!

          Show
          Eric Payne added a comment - @Devaraj, Can you please upmerge the patch to the latest code in http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23 Thanks!
          Hide
          Eric Payne added a comment -

          @Arun,

          Would it be possible to merge this int 0.23.2?

          Thanks

          Show
          Eric Payne added a comment - @Arun, Would it be possible to merge this int 0.23.2? Thanks
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12513943/MAPREDUCE-3034-4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1828//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1828//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12513943/MAPREDUCE-3034-4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1828//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1828//console This message is automatically generated.
          Hide
          Devaraj K added a comment -

          Thanks a lot Arun for looking into the patch.

          1.set/get on isRebooted needs to be synchronized

          I updated the patch with this change.

          2.On reboot, we should be kill existing containers, if any?

          I tested with the patch with/without having running containers in the NM. If any containers are running, it will stop all those containers as part of NM service stop.

          Thank you so much Eric for verifying and describing all the cases.

          Show
          Devaraj K added a comment - Thanks a lot Arun for looking into the patch. 1.set/get on isRebooted needs to be synchronized I updated the patch with this change. 2.On reboot, we should be kill existing containers, if any? I tested with the patch with/without having running containers in the NM. If any containers are running, it will stop all those containers as part of NM service stop. Thank you so much Eric for verifying and describing all the cases.
          Hide
          Eric Payne added a comment -

          @Arun,

          I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops the AppMaster and Containers on the NM via the AsyncDispatcher event process. I was able to verify this by examining the code, running tests, and examining the logs.

          1. Verified by examining the code:
            • When the reboot command comes from the RM to the NM, NodeStatusUpdaterImpl.reboot() sets the isRebooted flag and calls NodeStatusUpdaterImpl.stop().
            • NodeStatusUpdaterImpl.stop() (eventually) calls both AbstractService.changeState() and CompositeService.stop(int numOfServicesStarted). These methods loop through the list of services registered with them and stop each one.
          2. Verified by running tests:
            • With this change implemented, I started a long-running mapred job and then stopped and restarted the RM.
            • During the interval between stopping and restarting the RM, I took a snapshot of the java processes running.
            • Also, during the interval between stopping and restarting the RM, I searched the NM and container logs for messages from the AsyncDispatcher to determine if any services were stopped. None were.
            • After restarting the RM, I took another snapshot of the java processes. An examination indicated that prior to starting the RM, the long-running mapred job was still running with the MRAppMaster and the container running in YarnChild. After the RM started again, the MRAppMaster and YarnChild processes were gone.
          3. Verified by examining logs:
            • After running the above test, I did another search through the NM and container logs and found several services that had been stopped via the AsyncDispatcher event process. Specifically of interest, the ones from the container syslog file were these:
              • JobHistoryEventHandler
              • ContainerLauncherImpl
              • MRAppMaster$ContainerLauncherRouter
              • RMCommunicator
              • MRAppMaster$ContainerAllocatorRouter
              • MRClientService
              • TaskCleaner
              • TaskHeartbeatHandler
              • TaskAttemptListenerImpl
              • Dispatcher
              • MRAppMaster
          Show
          Eric Payne added a comment - @Arun, I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops the AppMaster and Containers on the NM via the AsyncDispatcher event process. I was able to verify this by examining the code, running tests, and examining the logs. Verified by examining the code: When the reboot command comes from the RM to the NM, NodeStatusUpdaterImpl.reboot() sets the isRebooted flag and calls NodeStatusUpdaterImpl.stop(). NodeStatusUpdaterImpl.stop() (eventually) calls both AbstractService.changeState() and CompositeService.stop(int numOfServicesStarted). These methods loop through the list of services registered with them and stop each one. Verified by running tests: With this change implemented, I started a long-running mapred job and then stopped and restarted the RM. During the interval between stopping and restarting the RM, I took a snapshot of the java processes running. Also, during the interval between stopping and restarting the RM, I searched the NM and container logs for messages from the AsyncDispatcher to determine if any services were stopped. None were. After restarting the RM, I took another snapshot of the java processes. An examination indicated that prior to starting the RM, the long-running mapred job was still running with the MRAppMaster and the container running in YarnChild. After the RM started again, the MRAppMaster and YarnChild processes were gone. Verified by examining logs: After running the above test, I did another search through the NM and container logs and found several services that had been stopped via the AsyncDispatcher event process. Specifically of interest, the ones from the container syslog file were these: JobHistoryEventHandler ContainerLauncherImpl MRAppMaster$ContainerLauncherRouter RMCommunicator MRAppMaster$ContainerAllocatorRouter MRClientService TaskCleaner TaskHeartbeatHandler TaskAttemptListenerImpl Dispatcher MRAppMaster
          Hide
          Arun C Murthy added a comment -

          Sorry it took me so long. This is close, couple of nits:

          1. set/get on isRebooted needs to be synchronized
          2. On reboot, we should be kill existing containers, if any?
          Show
          Arun C Murthy added a comment - Sorry it took me so long. This is close, couple of nits: set/get on isRebooted needs to be synchronized On reboot, we should be kill existing containers, if any?
          Hide
          Devaraj K added a comment -


          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.mapred.TestJobCounters
          

          This test failure is not related to this patch.MAPREDUCE-3822 was already created by Mahadev to address this.

          Show
          Devaraj K added a comment - -1 core tests. The patch failed these unit tests: org.apache.hadoop.mapred.TestJobCounters This test failure is not related to this patch. MAPREDUCE-3822 was already created by Mahadev to address this.
          Hide
          Robert Joseph Evans added a comment -

          The patch looks good to me too. But it is a complex enough change that I don't feel comfortable committing it myself, I think Arun or Mahadev should look at the patch.

          Also could you look at the TestJobCounters test failure. It looks like the failure is sporadic, and not related to this patch, but I could not find an existing JIRA for it. Could you at least file a new JIRA for it?

          Show
          Robert Joseph Evans added a comment - The patch looks good to me too. But it is a complex enough change that I don't feel comfortable committing it myself, I think Arun or Mahadev should look at the patch. Also could you look at the TestJobCounters test failure. It looks like the failure is sporadic, and not related to this patch, but I could not find an existing JIRA for it. Could you at least file a new JIRA for it?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12513433/MAPREDUCE-3034-3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.mapred.TestJobCounters

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1787//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1787//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12513433/MAPREDUCE-3034-3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.mapred.TestJobCounters +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1787//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1787//console This message is automatically generated.
          Hide
          Eric Payne added a comment -

          +1 (non-binding)

          I approve the changes introduced in MAPREDUCE-3034-3.patch

          Show
          Eric Payne added a comment - +1 (non-binding) I approve the changes introduced in MAPREDUCE-3034 -3.patch
          Hide
          Devaraj K added a comment -

          Thanks a lot Robert for taking look at the patch.

          I have updated the patch with the above comments fixed.

          Show
          Devaraj K added a comment - Thanks a lot Robert for taking look at the patch. I have updated the patch with the above comments fixed.
          Hide
          Robert Joseph Evans added a comment -

          I looked through the patch and I have a few comments.

          Overall the patch looks very good.

          In TestNodeStatusUpdater.java there are some tests that wait for a specific state

          while(null == rebootedNodeManager){
            Thread.sleep(1000);
          }
          

          In later parts of the test you log that you are waiting and have an upper limit on how long you wait for. Please add in at least an upper limit on how long this loop will wait. I don't want to have a test with a live lock issue if someone breaks it so that state is never reached.

          And a few minor nits: I don't really like the name initNStartNodeManager, I would prefer to have it initAndStartNodeManager. Also in that method I personally think that "if (!isRebooted)" is more readable then "if (false == isRebooted)"

          Show
          Robert Joseph Evans added a comment - I looked through the patch and I have a few comments. Overall the patch looks very good. In TestNodeStatusUpdater.java there are some tests that wait for a specific state while ( null == rebootedNodeManager){ Thread .sleep(1000); } In later parts of the test you log that you are waiting and have an upper limit on how long you wait for. Please add in at least an upper limit on how long this loop will wait. I don't want to have a test with a live lock issue if someone breaks it so that state is never reached. And a few minor nits: I don't really like the name initNStartNodeManager, I would prefer to have it initAndStartNodeManager. Also in that method I personally think that "if (!isRebooted)" is more readable then "if (false == isRebooted)"
          Hide
          Robert Joseph Evans added a comment -

          The javadoc warnings look like they are fixed by HADOOP-8018.

          Show
          Robert Joseph Evans added a comment - The javadoc warnings look like they are fixed by HADOOP-8018 .
          Hide
          Devaraj K added a comment -
          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.
          

          These javadoc warnings are not related to this patch.

          Show
          Devaraj K added a comment - -1 javadoc. The javadoc tool appears to have generated 2 warning messages. These javadoc warnings are not related to this patch.
          Hide
          Eric Payne added a comment -

          @Mahadev,

          Can you please take a look at this patch and commit if it looks good?

          I have already done quite a bit of testing on it in both one-node simple cluster and 10-node secure cluster.

          Thanks,
          -Eric

          Show
          Eric Payne added a comment - @Mahadev, Can you please take a look at this patch and commit if it looks good? I have already done quite a bit of testing on it in both one-node simple cluster and 10-node secure cluster. Thanks, -Eric
          Hide
          Eric Payne added a comment -

          +1 (non-binding)

          Thanks Devaraj.

          The patch looks good.

          Show
          Eric Payne added a comment - +1 (non-binding) Thanks Devaraj. The patch looks good.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12512955/MAPREDUCE-3034-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1747//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1747//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512955/MAPREDUCE-3034-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1747//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1747//console This message is automatically generated.
          Hide
          Devaraj K added a comment -

          Thanks Eric for revewing and testing the patch.

          I have updated the patch with the suggested change.

          Show
          Devaraj K added a comment - Thanks Eric for revewing and testing the patch. I have updated the patch with the suggested change.
          Hide
          Eric Payne added a comment -

          Hi Devaraj,

          Thanks for updating the patch.

          It looks good to me with one exeption. The more I think about it, the more I think we should re-read the namenode configs when the NM restarts. My reason for this opinion is that, in this particular use case, the RM will have a new version of the configs, and you usually want the RMs configs and the NMs configs to match.

          Other than that, I am happy with the patch. I downloaded it and tested it in both a one-node simple cluster and in a 10-node security cluster. I restarted the RM several times and checked the heap dump to look for memory leaks, and I didn't see any. I also ran about 100 wordcount tests after restarting the RM/NMs.

          Show
          Eric Payne added a comment - Hi Devaraj, Thanks for updating the patch. It looks good to me with one exeption. The more I think about it, the more I think we should re-read the namenode configs when the NM restarts. My reason for this opinion is that, in this particular use case, the RM will have a new version of the configs, and you usually want the RMs configs and the NMs configs to match. Other than that, I am happy with the patch. I downloaded it and tested it in both a one-node simple cluster and in a 10-node security cluster. I restarted the RM several times and checked the heap dump to look for memory leaks, and I didn't see any. I also ran about 100 wordcount tests after restarting the RM/NMs.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12512746/MAPREDUCE-3034-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1735//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1735//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512746/MAPREDUCE-3034-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1735//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1735//console This message is automatically generated.
          Hide
          Devaraj K added a comment -

          I updated the patch by incorporating the above comments.

          Show
          Devaraj K added a comment - I updated the patch by incorporating the above comments.
          Hide
          Devaraj K added a comment -
          • Major:
            • NodeManager.stateChanged():
              • In the case of STATE.STOPPED both stop() and reboot() need to be called.
              • I don't see anywhere in the call hierarchy where NodeManager.stop() is ever called in the reboot case.
              • Following the logic through, it looks like this is the call sequence from your patch:
                • NodeStatusUpdaterImpl.startStatusUpdater().run()
                  • calls: NodeStatusUpdaterImpl.reboot()
                  • calls: AbstractService.stop()
                  • calls: AbstractService.changeState()
                  • calls: NodeManager.stateChanged()
                  • Which does not call NodeManager.stop() if NodeStatusUpdaterImpl.isRebooted is set.

          stop() was calling from reboot method. Anyway now I have moved it to common for SHUTDOWN and REBOOT cases.

            • NodeManager:
              • In order for the new private static variable nodeManager to be fully effective, you should take out the local NodeManager declaration within main(). Otherwise, the new getNodeManager() method won't always return the NodeManager instance.

          It was added only for reboot testability purpose. Now I have refactored the code.

            • Please look into the new findbugs and core test failures.

          These test failures are not related to this patch.

          • Minor:
            • NodeManager.main() / NodeManager.reboot():
              • Rather than copy the same code from main() into the new reboot() method, I would create a common method (call it commonNMMain, for example). You will probably need a parameter flag to tell you whether or not this is called from main() or from reboot().

          I have refactored the code accordingly.

            • NodeManager.reboot():
              • You should handle the case where nodeManagerShutdownHook is null. Even though it should not be null, if it is null, you should create a new instance of it.

          There will not be a case where it will be null. This condition is added only for test purpose.

            • NodeManager.reboot():
              • Rather than calling getConfig() when calling nodeManager.init(), I would argue that it would be better to create a new instance of conf so that any updates to the NM configs will be picked up. I guess it could be argued the other way, too. That is, you could make the case that the grid owner doesn't want the NM configs to change just because the RM was restarted. However, I think it should be a new NM config because the RM will have a new config.

          I don't feel it is a good idea to reload the configuration when the NM gets rebooted. If the user want to update any property then they should restart the NM manually.

            • NodeStatusUpdaterImpl:
              • YarnRPC rpc; is declared but never used.

          It was my mistake, I was doing some thing and forgot to remove. I have updated now.

            • TestNodeStatusUpdater:
              • I don't see where it is testing the RM-restart / NM-reboot case.

          This test case verifies that when NM gets reboot signal, current instance will be stopped and new instance will be started successfully.

          Show
          Devaraj K added a comment - Major: NodeManager.stateChanged() : In the case of STATE.STOPPED both stop() and reboot() need to be called. I don't see anywhere in the call hierarchy where NodeManager.stop() is ever called in the reboot case. Following the logic through, it looks like this is the call sequence from your patch: NodeStatusUpdaterImpl.startStatusUpdater().run() calls: NodeStatusUpdaterImpl.reboot() calls: AbstractService.stop() calls: AbstractService.changeState() calls: NodeManager.stateChanged() Which does not call NodeManager.stop() if NodeStatusUpdaterImpl.isRebooted is set. stop() was calling from reboot method. Anyway now I have moved it to common for SHUTDOWN and REBOOT cases. NodeManager : In order for the new private static variable nodeManager to be fully effective, you should take out the local NodeManager declaration within main() . Otherwise, the new getNodeManager() method won't always return the NodeManager instance. It was added only for reboot testability purpose. Now I have refactored the code. Please look into the new findbugs and core test failures. These test failures are not related to this patch. Minor: NodeManager.main() / NodeManager.reboot() : Rather than copy the same code from main() into the new reboot() method, I would create a common method (call it commonNMMain , for example). You will probably need a parameter flag to tell you whether or not this is called from main() or from reboot() . I have refactored the code accordingly. NodeManager.reboot() : You should handle the case where nodeManagerShutdownHook is null . Even though it should not be null , if it is null , you should create a new instance of it. There will not be a case where it will be null. This condition is added only for test purpose. NodeManager.reboot() : Rather than calling getConfig() when calling nodeManager.init() , I would argue that it would be better to create a new instance of conf so that any updates to the NM configs will be picked up. I guess it could be argued the other way, too. That is, you could make the case that the grid owner doesn't want the NM configs to change just because the RM was restarted. However, I think it should be a new NM config because the RM will have a new config. I don't feel it is a good idea to reload the configuration when the NM gets rebooted. If the user want to update any property then they should restart the NM manually. NodeStatusUpdaterImpl : YarnRPC rpc; is declared but never used. It was my mistake, I was doing some thing and forgot to remove. I have updated now. TestNodeStatusUpdater : I don't see where it is testing the RM-restart / NM-reboot case. This test case verifies that when NM gets reboot signal, current instance will be stopped and new instance will be started successfully.
          Hide
          Devaraj K added a comment -

          Thanks Eric for reviewing the patch.

          Cancelling the patch to address Eric review comments.

          Show
          Devaraj K added a comment - Thanks Eric for reviewing the patch. Cancelling the patch to address Eric review comments.
          Hide
          Eric Payne added a comment -

          Marking this as critical, since we want to alleviate as much SE pain points as possible, and this seems to be something that can be done fairly safely.

          Show
          Eric Payne added a comment - Marking this as critical, since we want to alleviate as much SE pain points as possible, and this seems to be something that can be done fairly safely.
          Hide
          Eric Payne added a comment -

          Hi Devaraj,

          Thanks for posting the patch. Here are my review comments:

          • Major:
            • NodeManager.stateChanged():
              • In the case of STATE.STOPPED both stop() and reboot() need to be called.
              • I don't see anywhere in the call hierarchy where NodeManager.stop() is ever called in the reboot case.
              • Following the logic through, it looks like this is the call sequence from your patch:
                • NodeStatusUpdaterImpl.startStatusUpdater().run()
                  • calls: NodeStatusUpdaterImpl.reboot()
                  • calls: AbstractService.stop()
                  • calls: AbstractService.changeState()
                  • calls: NodeManager.stateChanged()
                  • Which does not call NodeManager.stop() if NodeStatusUpdaterImpl.isRebooted is set.
            • NodeManager:
              • In order for the new private static variable nodeManager to be fully effective, you should take out the local NodeManager declaration within main(). Otherwise, the new getNodeManager() method won't always return the NodeManager instance.
            • Please look into the new findbugs and core test failures.
          • Minor:
            • NodeManager.main() / NodeManager.reboot():
              • Rather than copy the same code from main() into the new reboot() method, I would create a common method (call it commonNMMain, for example). You will probably need a parameter flag to tell you whether or not this is called from main() or from reboot().
            • NodeManager.reboot():
              • You should handle the case where nodeManagerShutdownHook is null. Even though it should not be null, if it is null, you should create a new instance of it.
            • NodeManager.reboot():
              • Rather than calling getConfig() when calling nodeManager.init(), I would argue that it would be better to create a new instance of conf so that any updates to the NM configs will be picked up. I guess it could be argued the other way, too. That is, you could make the case that the grid owner doesn't want the NM configs to change just because the RM was restarted. However, I think it should be a new NM config because the RM will have a new config.
            • NodeStatusUpdaterImpl:
              • YarnRPC rpc; is declared but never used.
            • TestNodeStatusUpdater:
              • I don't see where it is testing the RM-restart / NM-reboot case.
          Show
          Eric Payne added a comment - Hi Devaraj, Thanks for posting the patch. Here are my review comments: Major: NodeManager.stateChanged() : In the case of STATE.STOPPED both stop() and reboot() need to be called. I don't see anywhere in the call hierarchy where NodeManager.stop() is ever called in the reboot case. Following the logic through, it looks like this is the call sequence from your patch: NodeStatusUpdaterImpl.startStatusUpdater().run() calls: NodeStatusUpdaterImpl.reboot() calls: AbstractService.stop() calls: AbstractService.changeState() calls: NodeManager.stateChanged() Which does not call NodeManager.stop() if NodeStatusUpdaterImpl.isRebooted is set. NodeManager : In order for the new private static variable nodeManager to be fully effective, you should take out the local NodeManager declaration within main() . Otherwise, the new getNodeManager() method won't always return the NodeManager instance. Please look into the new findbugs and core test failures. Minor: NodeManager.main() / NodeManager.reboot() : Rather than copy the same code from main() into the new reboot() method, I would create a common method (call it commonNMMain , for example). You will probably need a parameter flag to tell you whether or not this is called from main() or from reboot() . NodeManager.reboot() : You should handle the case where nodeManagerShutdownHook is null . Even though it should not be null , if it is null , you should create a new instance of it. NodeManager.reboot() : Rather than calling getConfig() when calling nodeManager.init() , I would argue that it would be better to create a new instance of conf so that any updates to the NM configs will be picked up. I guess it could be argued the other way, too. That is, you could make the case that the grid owner doesn't want the NM configs to change just because the RM was restarted. However, I think it should be a new NM config because the RM will have a new config. NodeStatusUpdaterImpl : YarnRPC rpc; is declared but never used. TestNodeStatusUpdater : I don't see where it is testing the RM-restart / NM-reboot case.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12512311/MAPREDUCE-3034.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.mapred.TestJobClientGetJob
          org.apache.hadoop.mapred.TestMRWithDistributedCache
          org.apache.hadoop.mapred.TestLocalModeWithNewApis

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512311/MAPREDUCE-3034.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.mapred.TestJobClientGetJob org.apache.hadoop.mapred.TestMRWithDistributedCache org.apache.hadoop.mapred.TestLocalModeWithNewApis +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1704//console This message is automatically generated.
          Hide
          Devaraj K added a comment -

          I have updated the first level patch for review. Please review and give your comments.

          Show
          Devaraj K added a comment - I have updated the first level patch for review. Please review and give your comments.
          Hide
          Eric Payne added a comment -

          Devaraj,

          Go ahead and upload your patch and I will be glad to review it.

          Thanks,
          -Eric

          Show
          Eric Payne added a comment - Devaraj, Go ahead and upload your patch and I will be glad to review it. Thanks, -Eric
          Hide
          Devaraj K added a comment -

          Thanks Eric.

          I am also doing same as you described. When NM gets reboot command it stops all the services as SHUTDOWN command does, then initialize and start all the servies as NM startup does.

          I will post the patch some time today, you can have a review on it. If you have better patch/ready to post, please feel free to post it.

          Thanks
          Devaraj

          Show
          Devaraj K added a comment - Thanks Eric. I am also doing same as you described. When NM gets reboot command it stops all the services as SHUTDOWN command does, then initialize and start all the servies as NM startup does. I will post the patch some time today, you can have a review on it. If you have better patch/ready to post, please feel free to post it. Thanks Devaraj
          Hide
          Eric Payne added a comment -

          @Devaraj

          That's fine if you want to take it over. When do you think you can get a patch up? I was hoping to get this going within the next week.

          From my point of view, the basic requirement is to be able to bounce the RM without having to manually star every single NM in a very large cluster (thousands of NMs).

          Right now, when NM gets the reboot command from the RM, it just calls the stop hooks, just like if it gets a shutdown command. My plan is that if NM gets reboot command, it still executes the shutdown hook, but then add a reboot hook that executes the same basic code as was done to begin with in NameNode.main(). Is that your basic plan?

          I have already written up a "proof-of-concept" patch and tested it in a 10-node secure cluster. To test it, I shutdown RM and restarted it. After the restart, I ran an hour's worth of jobs and compared the time and heap size from before and after. They all looked good to me.

          Thanks,
          -Eric

          Show
          Eric Payne added a comment - @Devaraj That's fine if you want to take it over. When do you think you can get a patch up? I was hoping to get this going within the next week. From my point of view, the basic requirement is to be able to bounce the RM without having to manually star every single NM in a very large cluster (thousands of NMs). Right now, when NM gets the reboot command from the RM, it just calls the stop hooks, just like if it gets a shutdown command. My plan is that if NM gets reboot command, it still executes the shutdown hook, but then add a reboot hook that executes the same basic code as was done to begin with in NameNode.main(). Is that your basic plan? I have already written up a "proof-of-concept" patch and tested it in a 10-node secure cluster. To test it, I shutdown RM and restarted it. After the restart, I ran an hour's worth of jobs and compared the time and heap size from before and after. They all looked good to me. Thanks, -Eric
          Hide
          Devaraj K added a comment -

          Eric,

          If you don't mind I will take it up because already I have done some work on this issue.

          Show
          Devaraj K added a comment - Eric, If you don't mind I will take it up because already I have done some work on this issue.
          Hide
          Jonathan Eagles added a comment -

          MAPREDUCE-3272 has some good steps on manually testing this issue. Thanks, Ramya.

          Show
          Jonathan Eagles added a comment - MAPREDUCE-3272 has some good steps on manually testing this issue. Thanks, Ramya.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Downgrading the blocker status and unmarking the Fix-version.

          Show
          Vinod Kumar Vavilapalli added a comment - Downgrading the blocker status and unmarking the Fix-version.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Sounds good. I would rather see it done correctly all the way, instead of just shutting down the node manager.

          I agree. Let me reopen this and downgrade the importance.

          Show
          Vinod Kumar Vavilapalli added a comment - Sounds good. I would rather see it done correctly all the way, instead of just shutting down the node manager. I agree. Let me reopen this and downgrade the importance.
          Hide
          Robert Joseph Evans added a comment -

          Sounds good. I would rather see it done correctly all the way, instead of just shutting down the node manager.

          Show
          Robert Joseph Evans added a comment - Sounds good. I would rather see it done correctly all the way, instead of just shutting down the node manager.
          Hide
          Arun C Murthy added a comment -

          Will be fixed via MAPREDUCE-2775.

          Show
          Arun C Murthy added a comment - Will be fixed via MAPREDUCE-2775 .
          Hide
          Arun C Murthy added a comment -

          Bobby, thanks for taking this up.

          However, Devaraj K is nearly there with MAPREDUCE-2775 and handling REBOOT alongwith DECOMMISSION is a trivial one-line extension there. That patch is much more involved and hence, if you don't mind I'm going to mark this as a dup of MAPREDUCE-2775 and ask Devaraj to incorporate this. Thanks!

          Show
          Arun C Murthy added a comment - Bobby, thanks for taking this up. However, Devaraj K is nearly there with MAPREDUCE-2775 and handling REBOOT alongwith DECOMMISSION is a trivial one-line extension there. That patch is much more involved and hence, if you don't mind I'm going to mark this as a dup of MAPREDUCE-2775 and ask Devaraj to incorporate this. Thanks!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12499951/MR-3034.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12499951/MR-3034.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1092//console This message is automatically generated.
          Hide
          Robert Joseph Evans added a comment -

          This is a first go at the patch. I would like some feedback on the direction of the patch.

          Right now it is only doing a shutdown of the node manager. I need to do some more manual testing of this before I would consider it ready to be merged in, but I thought it would be good to get feedback sooner rather then later.

          Show
          Robert Joseph Evans added a comment - This is a first go at the patch. I would like some feedback on the direction of the patch. Right now it is only doing a shutdown of the node manager. I need to do some more manual testing of this before I would consider it ready to be merged in, but I thought it would be good to get feedback sooner rather then later.
          Hide
          Robert Joseph Evans added a comment -

          @Arun Assuming MR-2858 I shoudl have some cycles to work on this if that is OK with you.

          Show
          Robert Joseph Evans added a comment - @Arun Assuming MR-2858 I shoudl have some cycles to work on this if that is OK with you.
          Hide
          Arun C Murthy added a comment -

          Exactly what I was thinking...

          Show
          Arun C Murthy added a comment - Exactly what I was thinking...
          Hide
          Mahadev konar added a comment -

          For this, can we just do shutdown for 0.23 and lets admins restart the NodeManagers? Cleaning up the data structures to do a clean restart might be a little hard to do. We could do the restart in coming up releases. Thoughts?

          Show
          Mahadev konar added a comment - For this, can we just do shutdown for 0.23 and lets admins restart the NodeManagers? Cleaning up the data structures to do a clean restart might be a little hard to do. We could do the restart in coming up releases. Thoughts?
          Hide
          Amol Kekre added a comment -

          Vinod, any eta on it?

          Show
          Amol Kekre added a comment - Vinod, any eta on it?

            People

            • Assignee:
              Devaraj K
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development