Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3460

MR AM can hang if containers are allocated on a node blacklisted by the AM

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.0, 0.24.0
    • Fix Version/s: 0.23.1
    • Component/s: mr-am, mrv2
    • Labels:
      None

      Description

      When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
      find a corresponding container request.
      This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
      The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

      1. MR-3460.txt
        7 kB
        Robert Joseph Evans
      2. MR-3460.txt
        9 kB
        Robert Joseph Evans
      3. MR3460_v3.txt
        13 kB
        Siddharth Seth
      4. MR3460_v4.txt
        13 kB
        Robert Joseph Evans

        Activity

        Hide
        Brian Cho added a comment -

        Was a JIRA ever filed for using hostname:port instead of only hostname in FifoScheduler?

        Show
        Brian Cho added a comment - Was a JIRA ever filed for using hostname:port instead of only hostname in FifoScheduler?
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #916 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/916/)
        MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #916 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/916/ ) MAPREDUCE-3460 . MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans) sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/114/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/114/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #883 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/883/)
        MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #883 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/883/ ) MAPREDUCE-3460 . MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans) sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #96 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/96/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #96 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/96/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #4 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/4/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #4 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/4/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Commit #241 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/241/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #241 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/241/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Commit #257 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/257/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #257 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/257/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #1380 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1380/)
        MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1380 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1380/ ) MAPREDUCE-3460 . MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans) sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-0.23-Commit #246 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/246/)
        mrege MAPREDUCE-3460 from trunk

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #246 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/246/ ) mrege MAPREDUCE-3460 from trunk sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #1355 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1355/)
        MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1355 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1355/ ) MAPREDUCE-3460 . MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans) sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1429 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1429/)
        MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1429 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1429/ ) MAPREDUCE-3460 . MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans) sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
        Hide
        Siddharth Seth added a comment -

        Committed to trunk and branch-0.23. Thanks Hitesh and Bobby.

        Show
        Siddharth Seth added a comment - Committed to trunk and branch-0.23. Thanks Hitesh and Bobby.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12505898/MR3460_v4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505898/MR3460_v4.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Oh and I will be filing a JIRA for the fifo scheduler issue.

        Show
        Robert Joseph Evans added a comment - Oh and I will be filing a JIRA for the fifo scheduler issue.
        Hide
        Robert Joseph Evans added a comment -

        Yes Sid it did reproduce the issue. Thanks for doing that. I am just uploading a new patch that fixes some spelling mistakes I introduced.

        Show
        Robert Joseph Evans added a comment - Yes Sid it did reproduce the issue. Thanks for doing that. I am just uploading a new patch that fixes some spelling mistakes I introduced.
        Hide
        Robert Joseph Evans added a comment -

        I don't know for sure if the test simulates the situation or not yet, but yesterday before I left one of the tests we were running got into this situation and I was able to poke around a little bit. I have the complete set of logs for the AM and RM during that time, and I am walking through the logs now to try and understand exactly what happened, and try to reproduce it.

        From what I have seen so far the following is the set of events.

        2011-12-01 19:05:48,480 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO HOST H2
        2011-12-01 19:05:48,483 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO HOST H2
        2011-12-01 19:05:50,469 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO ATTEMPT attempt_1322524316055_0237_m_000000_0
        2011-12-01 19:05:50,476 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO ATTEMPT attempt_1322524316055_0237_m_000001_0
        2011-12-01 19:06:11,541 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO HOST H2
        2011-12-01 19:06:11,542 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO HOST H2
        2011-12-01 19:06:12,539 ATTEMPT attempt_1322524316055_0237_m_000000_0 FAILED
        2011-12-01 19:06:12,540 ATTEMPT attempt_1322524316055_0237_m_000001_0 FAILED
        2011-12-01 19:06:12,545 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO ATTEMPT attempt_1322524316055_0237_m_000002_0
        2011-12-01 19:06:12,555 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO ATTEMPT attempt_1322524316055_0237_m_000003_0
        2011-12-01 19:06:12,573 1 FAILURES ON H2
        2011-12-01 19:06:12,574 2 FAILURES ON H2
        2011-12-01 19:06:20,573 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO HOST H2
        2011-12-01 19:06:20,574 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO HOST H2
        2011-12-01 19:06:20,585 ATTEMPT attempt_1322524316055_0237_m_000002_0 FAILED
        2011-12-01 19:06:20,586 ATTEMPT attempt_1322524316055_0237_m_000003_0 FAILED
        2011-12-01 19:06:20,589 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO ATTEMPT attempt_1322524316055_0237_m_000001_1
        2011-12-01 19:06:20,592 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO ATTEMPT attempt_1322524316055_0237_m_000000_1
        2011-12-01 19:06:20,605 3 FAILURES ON H2
        2011-12-01 19:06:20,607 4 FAILURES ON H2
        2011-12-01 19:06:20,608 BLACKLISTED H2
        2011-12-01 19:06:23,998 ASSIGNED CONTAINER container_1322524316055_0237_01_000008 TO HOST H2
        2011-12-01 19:06:23,999 ASSIGNED CONTAINER container_1322524316055_0237_01_000009 TO HOST H2
        2011-12-01 19:06:26,647 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO HOST H1
        2011-12-01 19:06:26,649 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO HOST H1
        2011-12-01 19:06:28,635 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO ATTEMPT attempt_1322524316055_0237_m_000004_0
        2011-12-01 19:06:28,640 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO ATTEMPT attempt_1322524316055_0237_m_000005_0
        2011-12-01 19:06:40,839 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO HOST H1
        2011-12-01 19:06:40,840 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO HOST H1
        2011-12-01 19:06:42,675 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO ATTEMPT attempt_1322524316055_0237_m_000006_0
        2011-12-01 19:06:42,682 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO ATTEMPT attempt_1322524316055_0237_m_000007_0
        2011-12-01 19:06:45,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO HOST H1
        2011-12-01 19:06:45,699 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO HOST H1
        2011-12-01 19:06:46,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO ATTEMPT attempt_1322524316055_0237_m_000008_0
        2011-12-01 19:06:46,703 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO ATTEMPT attempt_1322524316055_0237_m_000009_0
        

        After that it looks like the scheduler has several requested container to assign, but it never assigns any of them, and the AM never asks for anything new.

        Show
        Robert Joseph Evans added a comment - I don't know for sure if the test simulates the situation or not yet, but yesterday before I left one of the tests we were running got into this situation and I was able to poke around a little bit. I have the complete set of logs for the AM and RM during that time, and I am walking through the logs now to try and understand exactly what happened, and try to reproduce it. From what I have seen so far the following is the set of events. 2011-12-01 19:05:48,480 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO HOST H2 2011-12-01 19:05:48,483 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO HOST H2 2011-12-01 19:05:50,469 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO ATTEMPT attempt_1322524316055_0237_m_000000_0 2011-12-01 19:05:50,476 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO ATTEMPT attempt_1322524316055_0237_m_000001_0 2011-12-01 19:06:11,541 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO HOST H2 2011-12-01 19:06:11,542 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO HOST H2 2011-12-01 19:06:12,539 ATTEMPT attempt_1322524316055_0237_m_000000_0 FAILED 2011-12-01 19:06:12,540 ATTEMPT attempt_1322524316055_0237_m_000001_0 FAILED 2011-12-01 19:06:12,545 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO ATTEMPT attempt_1322524316055_0237_m_000002_0 2011-12-01 19:06:12,555 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO ATTEMPT attempt_1322524316055_0237_m_000003_0 2011-12-01 19:06:12,573 1 FAILURES ON H2 2011-12-01 19:06:12,574 2 FAILURES ON H2 2011-12-01 19:06:20,573 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO HOST H2 2011-12-01 19:06:20,574 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO HOST H2 2011-12-01 19:06:20,585 ATTEMPT attempt_1322524316055_0237_m_000002_0 FAILED 2011-12-01 19:06:20,586 ATTEMPT attempt_1322524316055_0237_m_000003_0 FAILED 2011-12-01 19:06:20,589 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO ATTEMPT attempt_1322524316055_0237_m_000001_1 2011-12-01 19:06:20,592 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO ATTEMPT attempt_1322524316055_0237_m_000000_1 2011-12-01 19:06:20,605 3 FAILURES ON H2 2011-12-01 19:06:20,607 4 FAILURES ON H2 2011-12-01 19:06:20,608 BLACKLISTED H2 2011-12-01 19:06:23,998 ASSIGNED CONTAINER container_1322524316055_0237_01_000008 TO HOST H2 2011-12-01 19:06:23,999 ASSIGNED CONTAINER container_1322524316055_0237_01_000009 TO HOST H2 2011-12-01 19:06:26,647 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO HOST H1 2011-12-01 19:06:26,649 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO HOST H1 2011-12-01 19:06:28,635 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO ATTEMPT attempt_1322524316055_0237_m_000004_0 2011-12-01 19:06:28,640 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO ATTEMPT attempt_1322524316055_0237_m_000005_0 2011-12-01 19:06:40,839 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO HOST H1 2011-12-01 19:06:40,840 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO HOST H1 2011-12-01 19:06:42,675 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO ATTEMPT attempt_1322524316055_0237_m_000006_0 2011-12-01 19:06:42,682 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO ATTEMPT attempt_1322524316055_0237_m_000007_0 2011-12-01 19:06:45,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO HOST H1 2011-12-01 19:06:45,699 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO HOST H1 2011-12-01 19:06:46,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO ATTEMPT attempt_1322524316055_0237_m_000008_0 2011-12-01 19:06:46,703 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO ATTEMPT attempt_1322524316055_0237_m_000009_0 After that it looks like the scheduler has several requested container to assign, but it never assigns any of them, and the AM never asks for anything new.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12505827/MR3460_v3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505827/MR3460_v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it. The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local. As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask. If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests. This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us.

        Good catch! Like you said, the request shouldn't care about the port for data locality. The FifoScheduler seems to be using the entire nodeAddress for allocating containers - which is incorrect. The capacity scheduler appears to be working as it should though - using only the hostname to allocate containers.

        Show
        Siddharth Seth added a comment - When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it. The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local. As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask. If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests. This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us. Good catch! Like you said, the request shouldn't care about the port for data locality. The FifoScheduler seems to be using the entire nodeAddress for allocating containers - which is incorrect. The capacity scheduler appears to be working as it should though - using only the hostname to allocate containers.
        Hide
        Siddharth Seth added a comment -

        Bobby, could you please see if this test simulates the situation. Meanwhile, looking at your last comment about locality.

        Show
        Siddharth Seth added a comment - Bobby, could you please see if this test simulates the situation. Meanwhile, looking at your last comment about locality.
        Hide
        Robert Joseph Evans added a comment -

        Sid I think I may have found a bug in the scheduler/MR-AM, but I am not really sure about it or not, and I would like your feedback on it.

        When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it. The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local. As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask. If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests. This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us.

        Show
        Robert Joseph Evans added a comment - Sid I think I may have found a bug in the scheduler/MR-AM, but I am not really sure about it or not, and I would like your feedback on it. When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it. The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local. As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask. If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests. This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us.
        Hide
        Robert Joseph Evans added a comment -

        OK I understand why they all keep going to h1, because there is no way to request anything but h1 so it requests with a *. When h1 heart beats back in and it has free space on it then it still gets a container assigned to it. I don't see any evidence of requests being lost, without the patch even in this situation.

        Show
        Robert Joseph Evans added a comment - OK I understand why they all keep going to h1, because there is no way to request anything but h1 so it requests with a *. When h1 heart beats back in and it has free space on it then it still gets a container assigned to it. I don't see any evidence of requests being lost, without the patch even in this situation.
        Hide
        Robert Joseph Evans added a comment -

        Sid that didn't do it. It fails both with and without the patch. For some reason it looks like after steps 11 and 12 all am heartbeats still have the containers scheduled on h1 (even though it is blacklisted). I am investigating it.

        Show
        Robert Joseph Evans added a comment - Sid that didn't do it. It fails both with and without the patch. For some reason it looks like after steps 11 and 12 all am heartbeats still have the containers scheduled on h1 (even though it is blacklisted). I am investigating it.
        Hide
        Siddharth Seth added a comment -

        New container requests ignore node blacklisting - and make an entry into mapsHostMapping. That would be one way to recreate this issue (or alternately fix it).

        Something like
        1. request _1 on h1
        2. am heartbeat
        3. h1 heartbeat
        4. am heartbeat - container assigned
        5. fail _1 on h1
        6. request fast_fail replacement for _1
        7. am heartbeat - to update request
        8. request _3 on h3 / h1,h3
        9. h1 heartbeat - to schedule (RM only aware fast_fail _1 at this point)
        10. am heartbeat - to get a fast_fail allocated on a blacklisted node.
        11. h1 heartbeat
        12. h3 heartbeat

        Show
        Siddharth Seth added a comment - New container requests ignore node blacklisting - and make an entry into mapsHostMapping . That would be one way to recreate this issue (or alternately fix it). Something like 1. request _1 on h1 2. am heartbeat 3. h1 heartbeat 4. am heartbeat - container assigned 5. fail _1 on h1 6. request fast_fail replacement for _1 7. am heartbeat - to update request 8. request _3 on h3 / h1,h3 9. h1 heartbeat - to schedule (RM only aware fast_fail _1 at this point) 10. am heartbeat - to get a fast_fail allocated on a blacklisted node. 11. h1 heartbeat 12. h3 heartbeat
        Hide
        Robert Joseph Evans added a comment -

        I think I must be doing it wrong some how or I don't understand the order of things you are requesting. I am doing the following at it passes on both

        1. request _1 on h1
        2. am heartbeat()
        3. h1 heartbeat()
        4. am heartbeat() //Get _1 container back
        5. fail _1 so h1 is blacklisted
        6. request _3 on h3
        7. request fast fail map _2 on h1
          ... (More heartbeats to schedule things)

        This does not work to reproduce the issue because any requests for h1 added after h1 is blacklisted will have h1 removed.

        If I move the fast fail map request above h1 being blacklisted then when the container request comes back for h1 it sees that it is blacklisted. It will not find the request in the mapsHostMapping and will result to pulling a request out of maps, which still works. The only way we are going to get this deadlock is if some how maps is empty. I don't really see how the patch changes that. I really don't understand all of what the code is doing so I could just be completely wrong about it.

        Show
        Robert Joseph Evans added a comment - I think I must be doing it wrong some how or I don't understand the order of things you are requesting. I am doing the following at it passes on both request _1 on h1 am heartbeat() h1 heartbeat() am heartbeat() //Get _1 container back fail _1 so h1 is blacklisted request _3 on h3 request fast fail map _2 on h1 ... (More heartbeats to schedule things) This does not work to reproduce the issue because any requests for h1 added after h1 is blacklisted will have h1 removed. If I move the fast fail map request above h1 being blacklisted then when the container request comes back for h1 it sees that it is blacklisted. It will not find the request in the mapsHostMapping and will result to pulling a request out of maps, which still works. The only way we are going to get this deadlock is if some how maps is empty. I don't really see how the patch changes that. I really don't understand all of what the code is doing so I could just be completely wrong about it.
        Hide
        Robert Joseph Evans added a comment -

        Addressing Sids comments.

        Show
        Robert Joseph Evans added a comment - Addressing Sids comments.
        Hide
        Siddharth Seth added a comment -

        Bobby, the test is still failing with and without the change. I think the failed container needs to be sent after the first container is allocated - and the second container request + failed map request after this.

        Show
        Siddharth Seth added a comment - Bobby, the test is still failing with and without the change. I think the failed container needs to be sent after the first container is allocated - and the second container request + failed map request after this.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12505632/MR-3460.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505632/MR-3460.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Sid, You were correct. It was not accessing the expected code. I was confused because the FAST_FAIL_MAP container was still being assigned. It was just not sent to the scheduler before the node was blacklisted.

        I have updated the test, and also the code itself. The original patch was updating the list of failed maps and also the list of pending maps, but this caused the actual allocation of the container to fail later on.

        Show
        Robert Joseph Evans added a comment - Sid, You were correct. It was not accessing the expected code. I was confused because the FAST_FAIL_MAP container was still being assigned. It was just not sent to the scheduler before the node was blacklisted. I have updated the test, and also the code itself. The original patch was updating the list of failed maps and also the list of pending maps, but this caused the actual allocation of the container to fail later on.
        Hide
        Mahadev konar added a comment -

        Cancelling to address the issues, Sid pointed out.

        Show
        Mahadev konar added a comment - Cancelling to address the issues, Sid pointed out.
        Hide
        Siddharth Seth added a comment -

        Thanks for adding the unit test Bobby. The test passes with and without the change to the RMContainerAllocator. The MockRM needs to allocate a prio=5 container on h1 to reproduce the issue (and the MRAM needs to send back a release for this container). The AM was losing track of allocated containers with priority=5, hosts=empty, Host blacklisted by AM.

        Show
        Siddharth Seth added a comment - Thanks for adding the unit test Bobby. The test passes with and without the change to the RMContainerAllocator. The MockRM needs to allocate a prio=5 container on h1 to reproduce the issue (and the MRAM needs to send back a release for this container). The AM was losing track of allocated containers with priority=5, hosts=empty, Host blacklisted by AM.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12505513/MR-3460.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505513/MR-3460.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Adding a patch with The fix by Hitesh and a unit test to verify that it works.

        Show
        Robert Joseph Evans added a comment - Adding a patch with The fix by Hitesh and a unit test to verify that it works.
        Hide
        Mahadev konar added a comment -

        Great.

        Thanks Hitesh.

        Bobby, can you try it out and see if you can add a test case.

        As for the long term goal of cleaning up the if then else, we'll have to give it some thought before we go there. Hopefully 0.23 will be stable soon.

        Show
        Mahadev konar added a comment - Great. Thanks Hitesh. Bobby, can you try it out and see if you can add a test case. As for the long term goal of cleaning up the if then else, we'll have to give it some thought before we go there. Hopefully 0.23 will be stable soon.
        Hide
        Robert Joseph Evans added a comment -

        +1 to Hitesh's patch at least as a quick fix. I can try and reproduce the issue here and verify that the patch does indeed fix the issue. I can also add in a few unit tests for it and turn it into a real patch if you like.

        I would also like some feedback on a potential (long term) refactor of the code which would be done on a separate JIRA after 0.23 stabilizes. It seems to me that the root cause of this issue is because a special condition for a FAST_FAIL_MAP was missed. The code right now is written with lots of if else statements separating out map tasks from reduce tasks and also from failed map tasks, etc. I think it would be cleaner to replace the if statements with classes that use polymorphism to change the methods called. This would allow the different handling of a failed map from a normal map or from a reduce to be more evident. It would also force the internal data structures that keep track of the different types of tasks to be combined together. This is just something that popped into my head while trying to evaluate Hitesh's fix. I have not really evaluated what it would take to make it work or anything, I would just like some feedback about the idea before filing a JIRA about it.

        Show
        Robert Joseph Evans added a comment - +1 to Hitesh's patch at least as a quick fix. I can try and reproduce the issue here and verify that the patch does indeed fix the issue. I can also add in a few unit tests for it and turn it into a real patch if you like. I would also like some feedback on a potential (long term) refactor of the code which would be done on a separate JIRA after 0.23 stabilizes. It seems to me that the root cause of this issue is because a special condition for a FAST_FAIL_MAP was missed. The code right now is written with lots of if else statements separating out map tasks from reduce tasks and also from failed map tasks, etc. I think it would be cleaner to replace the if statements with classes that use polymorphism to change the methods called. This would allow the different handling of a failed map from a normal map or from a reduce to be more evident. It would also force the internal data structures that keep track of the different types of tasks to be combined together. This is just something that popped into my head while trying to evaluate Hitesh's fix. I have not really evaluated what it would take to make it work or anything, I would just like some feedback about the idea before filing a JIRA about it.
        Hide
        Hitesh Shah added a comment -

        Based on Sid's theory, the problem would be in RmContainerAllocator#getContainerReqToReplace.

        -      if (PRIORITY_FAST_FAIL_MAP.equals(priority) 
        -          || PRIORITY_MAP.equals(priority)) {
        +      if (PRIORITY_FAST_FAIL_MAP.equals(priority)) {
        +        while (toBeReplaced == null && earlierFailedMaps.size() > 0) {
        +          TaskAttemptId tId = earlierFailedMaps.removeFirst();
        +          if (maps.containsKey(tId)) {
        +            toBeReplaced = maps.remove(tId);
        +          }
        +        }
        +        return toBeReplaced;
        +      }
        +      else if (PRIORITY_MAP.equals(priority)) {
        
        Show
        Hitesh Shah added a comment - Based on Sid's theory, the problem would be in RmContainerAllocator#getContainerReqToReplace. - if (PRIORITY_FAST_FAIL_MAP.equals(priority) - || PRIORITY_MAP.equals(priority)) { + if (PRIORITY_FAST_FAIL_MAP.equals(priority)) { + while (toBeReplaced == null && earlierFailedMaps.size() > 0) { + TaskAttemptId tId = earlierFailedMaps.removeFirst(); + if (maps.containsKey(tId)) { + toBeReplaced = maps.remove(tId); + } + } + return toBeReplaced; + } + else if (PRIORITY_MAP.equals(priority)) {

          People

          • Assignee:
            Robert Joseph Evans
            Reporter:
            Siddharth Seth
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development