Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3261

AM unable to release containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.23.0
    • applicationmaster
    • None

    Description

      I'm probably doing something wrong here, but I can't figure it out.

      My ApplicationMaster is sending an AllocateRequest with ContainerIds to release. My ResourceManager logs say:

      2011-10-25 10:02:52,236 WARN resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(207)) - USER=criccomi IP=127.0.0.1 OPERATION=AM Released Container TARGET=FifoScheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id PERMISSIONS=Unauthorized access or invalid container APPID=application_1319485153554_0028 CONTAINERID=container_1319485153554_0028_01_000003

      The container ID is valid, as is the app id:

      [criccomi@criccomi-ld logs]$ pwd
      /tmp/logs
      [criccomi@criccomi-ld logs]$ find .
      .
      ./application_1319485153554_0028
      ./application_1319485153554_0028/container_1319485153554_0028_01_000002
      ./application_1319485153554_0028/container_1319485153554_0028_01_000002/stderr
      ./application_1319485153554_0028/container_1319485153554_0028_01_000002/stdout
      ./application_1319485153554_0028/container_1319485153554_0028_01_000001
      ./application_1319485153554_0028/container_1319485153554_0028_01_000001/stderr
      ./application_1319485153554_0028/container_1319485153554_0028_01_000001/stdout
      ./application_1319485153554_0028/container_1319485153554_0028_01_000003
      ./application_1319485153554_0028/container_1319485153554_0028_01_000003/stderr
      ./application_1319485153554_0028/container_1319485153554_0028_01_000003/stdout
      ./application_1319485153554_0028/container_1319485153554_0028_01_000006
      ./application_1319485153554_0028/container_1319485153554_0028_01_000006/stderr
      ./application_1319485153554_0028/container_1319485153554_0028_01_000006/stdout

      The containers are still running.

      My code to start a container, and then to release it:

        // ugi = UserGroupInformation.getCurrentUser
        // security is not enabled
        def startContainer(packagePath: Path, container: Container, ugi: UserGroupInformation, env: Map[String, String], cmds: String*) {
          info("%s starting container %s %s %s %s %s" format (appAttemptId, packagePath, container, ugi, env, cmds))
          // connect to container manager (based on similar code in the ContainerLauncher in Hadoop MapReduce)
          val contToken = container.getContainerToken
          val address = container.getNodeId.getHost + ":" + container.getNodeId.getPort
          var user = ugi
      
          if (UserGroupInformation.isSecurityEnabled) {
            debug("%s security is enabled" format (appAttemptId))
            val hadoopToken = new Token[ContainerTokenIdentifier](contToken.getIdentifier.array, contToken.getPassword.array, new Text(contToken.getKind), new Text(contToken.getService))
            user = UserGroupInformation.createRemoteUser(address)
            user.addToken(hadoopToken)
            info("%s changed user to %s" format (appAttemptId, user))
          }
      
          val containerManager = user.doAs(new PrivilegedAction[ContainerManager] {
            def run(): ContainerManager = {
              return YarnRPC.create(conf).getProxy(classOf[ContainerManager], NetUtils.createSocketAddr(address), conf).asInstanceOf[ContainerManager]
            }
          })
      
          // set the local package so that the containers and app master are provisioned with it
          val packageResource = Records.newRecord(classOf[LocalResource])
          val packageUrl = ConverterUtils.getYarnUrlFromPath(packagePath)
          val fileStatus = packagePath.getFileSystem(conf).getFileStatus(packagePath)
      
          packageResource.setResource(packageUrl)
          packageResource.setSize(fileStatus.getLen)
          packageResource.setTimestamp(fileStatus.getModificationTime)
          packageResource.setType(LocalResourceType.ARCHIVE)
          packageResource.setVisibility(LocalResourceVisibility.APPLICATION)
      
          // start the container
          val ctx = Records.newRecord(classOf[ContainerLaunchContext])
          ctx.setEnvironment(env)
          ctx.setContainerId(container.getId())
          ctx.setResource(container.getResource())
          ctx.setUser(user.getShortUserName())
          ctx.setCommands(cmds.toList)
          ctx.setLocalResources(Collections.singletonMap("package", packageResource))
      
          debug("%s setting package to %s" format (appAttemptId, packageResource))
          debug("%s setting context to %s" format (appAttemptId, ctx))
      
          val startContainerRequest = Records.newRecord(classOf[StartContainerRequest])
          startContainerRequest.setContainerLaunchContext(ctx)
          containerManager.startContainer(startContainerRequest)
        }
      

        def sendResourceRequest(requests: List[ResourceRequest], release: List[ContainerId]): AMResponse = {
          info("%s sending resource request %s %s" format (appAttemptId, requests, release))
          val req = Records.newRecord(classOf[AllocateRequest])
          req.setResponseId(requestId)
          req.setApplicationAttemptId(appAttemptId)
          req.addAllAsks(requests)
          req.addAllReleases(release)
          requestId += 1
          debug("%s RM resource request %s" format (appAttemptId, req))
          resourceManager.allocate(req).getAMResponse
        }
      

      I have double checked that my ContainerIds are accurate, and they are.

      Any idea what I'm doing wrong here?

      Attachments

        Activity

          People

            Unassigned Unassigned
            criccomini Chris Riccomini
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: