Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4037

ChildQuery::Cancel() appears to violate lock ordering

    Details

      Description

      The lock order is ImpalaServer::query_exec_state_map_lock_ -> QueryExecState::lock_, but there's a possible callstack that violates this.

      • ImpalaServer::CancelInternal() holding QueryExecState::lock_
      • calls QueryExecState::Cancel()
      • calls ChildQuery::Cancel()
      • calls ImpalaServer::CancelOperation()
      • calls ImpalaServer::GetQueryExecState(), which grabs ImpalaServer::query_exec_state_map_lock_

      This could lead to deadlock if another thread called GetQueryExecState() on the same ImpalaServer instance (i.e. HS2/beewax) for the same QueryExecState at the same time.

        Issue Links

          Activity

          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-4037,IMPALA-4038: fix locking during query cancellation

          • Refactor the child query handling out of QueryExecState and clarify
            locking rules.
          • Avoid holding QueryExecState::lock_ while calling
            Coordinator::Cancel() or ChildQuery::Cancel(), which can both do RPCs
            or acquire ImpalaServer::query_exec_state_map_lock_.
          • Fix a potential race between QueryExecState::Exec() and
            QueryExecState::Cancel() where the cancelling thread did an unlocked
            read of the 'coordinator_' field and may not have cancelled the
            coordinator.

          Testing:
          Ran exhaustive build, ran local stress test for a bit.

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-4037 , IMPALA-4038 : fix locking during query cancellation Refactor the child query handling out of QueryExecState and clarify locking rules. Avoid holding QueryExecState::lock_ while calling Coordinator::Cancel() or ChildQuery::Cancel(), which can both do RPCs or acquire ImpalaServer::query_exec_state_map_lock_. Fix a potential race between QueryExecState::Exec() and QueryExecState::Cancel() where the cancelling thread did an unlocked read of the 'coordinator_' field and may not have cancelled the coordinator. Testing: Ran exhaustive build, ran local stress test for a bit.
          Hide
          laszlog Laszlo Gaal added a comment -

          Code review can be found here: https://gerrit.cloudera.org/#/c/4163/

          Show
          laszlog Laszlo Gaal added a comment - Code review can be found here: https://gerrit.cloudera.org/#/c/4163/

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development