Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7305

membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
    • Impala 2.12.0, Impala 3.1.0
    • Distributed Exec
    • None
    • ghx-label-2

    Description

      I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.

      Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.

      Known backend(s): 3
      Address	Coordinator	Executor
      tarmstrong-box:22002 	true 	true
      tarmstrong-box:22001 	true 	true
      tarmstrong-box:22000 	true 	true
      

      The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.

      IMPALA-4953 fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tarmstrong Tim Armstrong
            tarmstrong Tim Armstrong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment