Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3082

tablets in "CONSENSUS_MISMATCH" state for a long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.1
    • None
    • consensus
    • None

    Description

      Lately we found a few tablets in one of our clusters are unhealthy, the ksck output is like:

       

      Tablet Summary
      Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' active configs disagree with the leader master's
        7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
        d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
      All reported replicas are:
        A = 7380d797d2ea49e88d71091802fb1c81
        B = d1952499f94a4e6087bee28466fcb09f
        C = 47af52df1adc47e1903eb097e9c88f2e
        D = 08beca5ed4d04003b6979bf8bac378d2
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B   C*       |              |              | Yes
       A             | A   B   C*       | 5            | -1           | Yes
       B             | A   B   C        | 5            | -1           | Yes
       C             | A   B   C*  D~   | 5            | 54649        | No
      Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active configs disagree with the leader master's
        d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
        5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
      All reported replicas are:
        A = d1952499f94a4e6087bee28466fcb09f
        B = 47af52df1adc47e1903eb097e9c88f2e
        C = 5a8aeadabdd140c29a09dabcae919b31
        D = 14632cdbb0d04279bc772f64e06389f9
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B*  C        |              |              | Yes
       A             | A   B*  C        | 5            | 5            | Yes
       B             | A   B*  C   D~   | 5            | 96176        | No
       C             | A   B*  C        | 5            | 5            | Yes
      Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' active configs disagree with the leader master's
        a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING
        f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
      All reported replicas are:
        A = a9eaff3cf1ed483aae849549999d649a
        B = f75df4a6b5ce404884313af5f906b392
        C = 47af52df1adc47e1903eb097e9c88f2e
        D = d1952499f94a4e6087bee28466fcb09f
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A   B   C*       |              |              | Yes
       A             | A   B   C*       | 1            | -1           | Yes
       B             | A   B   C*       | 1            | -1           | Yes
       C             | A   B   C*  D~   | 1            | 2            | No
      Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active configs disagree with the leader master's
        47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
        f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
        f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
      All reported replicas are:
        A = 47af52df1adc47e1903eb097e9c88f2e
        B = f0f7b2f4b9d344e6929105f48365f38e
        C = f75df4a6b5ce404884313af5f906b392
        D = d1952499f94a4e6087bee28466fcb09f
      The consensus matrix is:
       Config source |     Replicas     | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A*  B   C        |              |              | Yes
       A             | A*  B   C   D~   | 1            | 1991         | No
       B             | A*  B   C        | 1            | 4            | Yes
       C             | A*  B   C        | 1            | 4            | Yes

      These tablets couldn't recover for a couple of days until we restart kudu-ts27.

      I found so many duplicated logs in kudu-ts27 are like:

      I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing.
      I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing.
      
      

      There seems to be some RaftConfig change operations that somehow cannot complete.

       

      Attachments

        1. ts26.log.gz
          1.08 MB
          YifanZhang
        2. ts25.info.gz
          1.52 MB
          YifanZhang
        3. master_leader.log
          7 kB
          YifanZhang

        Activity

          People

            Unassigned Unassigned
            zhangyifan27 YifanZhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: