Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7305

membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
    • Fix Version/s: Impala 2.12.0, Impala 3.1.0
    • Component/s: Distributed Exec
    • Labels:
      None
    • Epic Color:
      ghx-label-2

      Description

      I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.

      Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.

      Known backend(s): 3
      Address	Coordinator	Executor
      tarmstrong-box:22002 	true 	true
      tarmstrong-box:22001 	true 	true
      tarmstrong-box:22000 	true 	true
      

      The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.

      IMPALA-4953 fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.

        Attachments

        1. 0001-Repro-CDH-70703.patch
          3 kB
          Tim Armstrong

          Issue Links

            Activity

              People

              • Assignee:
                tarmstrong Tim Armstrong
                Reporter:
                tarmstrong Tim Armstrong
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: