Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7872 Extended health checks to mark node as down
  3. IMPALA-10476

Remove executor node with faulty disks from executor group

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Distributed Exec
    • None
    • ghx-label-3

    Description

      If an executor node frequently gets disk IO failures when reading/writing local disk, it should report its unhealthy state to statestore so that the node could be marked as down and be removed from executor group to avoid repeated query failures in the cluster. This provides a mechanism for executor node to remove itself from scheduling.

      The two major components of Impala that read/write from local disk are the spill-to-disk and data caching features. We need to add stats for counting such local disk failures over a period of time like last x seconds, then use these stats to measure if a node is in good health for executing query fragment instances.   

      The healthy state of an executor node should be shown on the debug WebUI. We should also allow users to overwrite the node's healthy state. The node will restart to register itself in the statestore once its healthy state is overwritten.

      Attachments

        Activity

          People

            wzhou Wenzhe Zhou
            wzhou Wenzhe Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: