Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9425

Statestore may fail to report when an impalad has failed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 3.4.0
    • Fix Version/s: Impala 3.4.0
    • Component/s: Distributed Exec
    • Labels:
      None
    • Epic Color:
      ghx-label-10

      Description

      If an impalad fails and another is restarted at the same host:port combination quickly, the statestore may fail to report to the coordinators that the impalad went down.

      The reason for this is that in the cluster membership topic, impalads are keyed by their statestore subscriber id, which is "impalad@host:port". If the new impalad registers itself before a topic update has been generated for a particular coordinator, the statestore has no way of knowing that the particular key was deleted and then re-added since the last update.

      The result is that queries that were running on the impalad that failed may not be cancelled by the coordinator until they pass the unresponsive backend timeout, which by default is ~12 minutes.

      I propose as a solution that we add a concept of uuids for impalads, where each impalad will generate its own uuid on startup. This allows us to differentiate between different impalads running at the same host:port combination.

      It can also be used to simplify some logic in the scheduler and ExecutorGroup/ExecutorBlacklist etc. where we currently have data structures containing info about impalads that are keyed off host/port combinations.

        Attachments

          Activity

            People

            • Assignee:
              twmarshall Thomas Tauber-Marshall
              Reporter:
              twmarshall Thomas Tauber-Marshall
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: