Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4409

ImpalaServer::CancelInternal() can deadlock

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.7.0, Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Distributed Exec
    • Labels:

      Description

      While looking through the code to understand another potential bug, I found out that CancelInternal() violates the lock ordering in ImpalaServer by acquiring SessionState::lock_ while holding QueryExecState::lock_. This could cause a hang if a number of threads were holding the right locks at the right time. I traced out the lock dependencies using a directed graph, which I've attached.

      1. impala-server-lock-order.dot
        1.0 kB
        Tim Armstrong
      2. impala-server-lock-order.pdf
        19 kB
        Tim Armstrong

        Activity

        Hide
        HuaisiXu Huaisi Xu added a comment -

        is this caused by fixes from IMPALA-4037,IMPALA-4038? Could you also elaborate under which events this can happen?

        Show
        HuaisiXu Huaisi Xu added a comment - is this caused by fixes from IMPALA-4037 , IMPALA-4038 ? Could you also elaborate under which events this can happen?
        Hide
        tarmstrong Tim Armstrong added a comment -

        No it predates those changes. To understand when it can happen, look at the cycle in the lock order graph.

        Show
        tarmstrong Tim Armstrong added a comment - No it predates those changes. To understand when it can happen, look at the cycle in the lock order graph.
        Hide
        jyu@cloudera.com Juan Yu added a comment -

        commit 9755ea63b50d23f323c84edb4307ce603ded4fe1
        Author: Tim Armstrong <tarmstrong@cloudera.com>
        Date: Mon Oct 31 16:40:53 2016 -0700

        IMPALA-4409: respect lock order in QueryExecState::CancelInternal()

        The code previously violated the (partially documented) lock order
        in ImpalaServer. An example of a possible cycle in the dependency
        graph is:

        • SetQueryInFlight() holds SessionState::lock_ and waits for
          'query_expiration_lock_'
        • ExpireQueries() holds 'query_expiration_lock_' and waits for
          'query_exec_state_map_lock_'
        • GetQueryExecState() holds 'query_exec_state_map_lock_' and
          waits for QueryExecState::lock_
        • QES::Cancel() holds QueryExecState::lock_
          and waits for SessionState::lock

        It's not clear how likely the above scenario is, but it's hard to rule
        it out.

        We have not seen this hang in the wild but have seen similar ones.

        Testing:
        Ran local stress test on 3-node minicluster with TPC-H 20 and 50%
        of queries being cancelled.

        Change-Id: I785fea0163a90d0633fb6ed77ec7c6882ab5c110
        Reviewed-on: http://gerrit.cloudera.org:8080/4896
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        jyu@cloudera.com Juan Yu added a comment - commit 9755ea63b50d23f323c84edb4307ce603ded4fe1 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Mon Oct 31 16:40:53 2016 -0700 IMPALA-4409 : respect lock order in QueryExecState::CancelInternal() The code previously violated the (partially documented) lock order in ImpalaServer. An example of a possible cycle in the dependency graph is: SetQueryInFlight() holds SessionState::lock_ and waits for 'query_expiration_lock_' ExpireQueries() holds 'query_expiration_lock_' and waits for 'query_exec_state_map_lock_' GetQueryExecState() holds 'query_exec_state_map_lock_' and waits for QueryExecState::lock_ QES::Cancel() holds QueryExecState::lock_ and waits for SessionState::lock It's not clear how likely the above scenario is, but it's hard to rule it out. We have not seen this hang in the wild but have seen similar ones. Testing: Ran local stress test on 3-node minicluster with TPC-H 20 and 50% of queries being cancelled. Change-Id: I785fea0163a90d0633fb6ed77ec7c6882ab5c110 Reviewed-on: http://gerrit.cloudera.org:8080/4896 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            tarmstrong Tim Armstrong
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development