Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9295

RPC failures don't always trigger a blacklist

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Impala 3.4.0
    • Component/s: Backend
    • Labels:
      None
    • Epic Color:
      ghx-label-8

      Description

      There is a race condition in IMPALA-9137. It is possible for the aux_error_info and the failure status to arrive in separate exec status reports.

      IMPALA-9137 added AuxErrorInfoPB to FragmentInstanceExecStatusPB (contains per-fragment info for a ReportExecStatusRequestPB). The idea is that if a query fails, the Coordinator would use the AuxErrorInfoPB to potentially blacklist any nodes that caused the failure. The Coordinator only looks for AuxErrorInfoPB if the query has failed (e.g. if ReportExecStatusRequestPB::overall_status is set to an error).

      The issue is that is is possible that the AuxErrorInfoPB is set even though overall_status == OK. There is a race condition on the executor side where the setting of the aux_error_info and and overall_status is not synchronized. So if a fragment fails due to an RPC error, it is possible for report number "x" to include the aux_error_info with overall_status == OK, and report number "x + 1" to include no aux_error_info with overall_status == [some-RPC-failure-message].

      Report num "x +1" won't include the aux_error_info since the fragment has finished and its last FragmentInstanceExecStatusPB was in report num "x".

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                stakiar Sahil Takiar
                Reporter:
                stakiar Sahil Takiar
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: