Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3082

tablets in "CONSENSUS_MISMATCH" state for a long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.1
    • None
    • consensus
    • None

    Description

      Lately we found a few tablets in one of our clusters are unhealthy, the ksck output is like:

       

      Tablet Summary
      Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' active configs disagree with the leader master's
        7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
        d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
      All reported replicas are:
        A = 7380d797d2ea49e88d71091802fb1c81
        B = d1952499f94a4e6087bee28466fcb09f
        C = 47af52df1adc47e1903eb097e9c88f2e
        D = 08beca5ed4d04003b6979bf8bac378d2
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B   C*       |              |              | Yes
       A             | A   B   C*       | 5            | -1           | Yes
       B             | A   B   C        | 5            | -1           | Yes
       C             | A   B   C*  D~   | 5            | 54649        | No
      Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active configs disagree with the leader master's
        d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
        5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
      All reported replicas are:
        A = d1952499f94a4e6087bee28466fcb09f
        B = 47af52df1adc47e1903eb097e9c88f2e
        C = 5a8aeadabdd140c29a09dabcae919b31
        D = 14632cdbb0d04279bc772f64e06389f9
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B*  C        |              |              | Yes
       A             | A   B*  C        | 5            | 5            | Yes
       B             | A   B*  C   D~   | 5            | 96176        | No
       C             | A   B*  C        | 5            | 5            | Yes
      Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' active configs disagree with the leader master's
        a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING
        f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
      All reported replicas are:
        A = a9eaff3cf1ed483aae849549999d649a
        B = f75df4a6b5ce404884313af5f906b392
        C = 47af52df1adc47e1903eb097e9c88f2e
        D = d1952499f94a4e6087bee28466fcb09f
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B   C*       |              |              | Yes
       A             | A   B   C*       | 1            | -1           | Yes
       B             | A   B   C*       | 1            | -1           | Yes
       C             | A   B   C*  D~   | 1            | 2            | No
      Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active configs disagree with the leader master's
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
        f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
        f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
      All reported replicas are:
        A = 47af52df1adc47e1903eb097e9c88f2e
        B = f0f7b2f4b9d344e6929105f48365f38e
        C = f75df4a6b5ce404884313af5f906b392
        D = d1952499f94a4e6087bee28466fcb09f
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A*  B   C        |              |              | Yes
       A             | A*  B   C   D~   | 1            | 1991         | No
       B             | A*  B   C        | 1            | 4            | Yes
       C             | A*  B   C        | 1            | 4            | Yes

      These tablets couldn't recover for a couple of days until we restart kudu-ts27.

      I found so many duplicated logs in kudu-ts27 are like:

      I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing.
      I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing.
      
      

      There seems to be some RaftConfig change operations that somehow cannot complete.

       

      Attachments

        1. master_leader.log
          7 kB
          YifanZhang
        2. ts25.info.gz
          1.52 MB
          YifanZhang
        3. ts26.log.gz
          1.08 MB
          YifanZhang

        Activity

          People

            Unassigned Unassigned
            zhangyifan27 YifanZhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: