Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-15173

Express Upgrade Stuck At Manual Prompt Due To HRC Status Calculation Cache Problem

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.2.2
    • 2.2.2
    • ambari-server
    • None

    Description

      Seen while performing an upgrade, it's possible that the status of a request/stage does not match that of its tasks. Essentially, the task could be HOLDING while the request is still IN_PROGRESS.

      I believe that AMBARI-15011 is responsible for this issue. AMBARI-15011 introduced, among other things, a cache to the HostRoleCommandStatusSummaryDTO which is a aggregation of the number of tasks a stage has in each state (PENDING, HOLDING, etc).

      This HostRoleCommandStatusSummaryDTO is used by CalculatedState to calculate a stage's and request's status based on the tasks.

      The problem is that ServerActionExecutor is moving a tasks's state to HOLDING (reflected in the database correctly) but the cache invalidation happens inside the uncommitted transaction. This causes stale data to be re-cached. So, when we go to calculate the request and state status, we get IN_PROGRESS instead of HOLDING.

      {
        "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1?fields=*,tasks/*",
        "Stage": {
          "cluster_name": "cl1",
          "context": "Stop YARN Queues",
          "display_status": "IN_PROGRESS",
          "end_time": -1,
          "progress_percent": 35,
          "request_id": 61,
          "skippable": true,
          "stage_id": 1,
          "start_time": 1456227329191,
          "status": "IN_PROGRESS"
        },
        "tasks": [
          {
            "href": "http://172.22.72.13:8080/api/v1/clusters/cl1/requests/61/stages/1/tasks/754",
            "Tasks": {
              "attempt_cnt": 1,
              "cluster_name": "cl1",
              "command": "EXECUTE",
              "command_detail": "Before continuing, please stop all YARN queues. If yarn-site's yarn.resourcemanager.work-preserving-recovery.enabled is set to true, then you can skip this step since the clients will retry on their own.",
              "custom_command_name": "org.apache.ambari.server.serveraction.upgrades.ManualStageAction",
              "end_time": -1,
              "error_log": "errors-754.txt",
              "exit_code": 0,
              "host_name": "os-r6-mkqzcs-c10tom21unsecha-6.novalocal",
              "id": 754,
              "output_log": "output-754.txt",
              "request_id": 61,
              "role": "AMBARI_SERVER_ACTION",
              "stage_id": 1,
              "start_time": 1456227329191,
              "status": "HOLDING",
              "stderr": "",
              "stdout": "",
              "structured_out": {}
            }
          }
        ]
      }
      

      Attachments

        Issue Links

          Activity

            People

              jonathanhurley Jonathan Hurley
              jonathanhurley Jonathan Hurley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: