Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2410

Add auto-repair function to ksck to repair "stuck tablet" situations common on older versions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.7.0
    • NA
    • supportability
    • None

    Description

      There's two common situations where tablets get stuck and can't recover automatically, characterized by the following ksck outputs:

      Tombstone + eviction in-flight:

      Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING
        16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING
        61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER]
        d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): bad state
          State:       NOT_STARTED
          Data state:  TABLET_DATA_TOMBSTONED
          Last status: Tablet initializing...
      
      2 replicas' active configs differ from the master's.
        All the peers reported by the master and tablet servers are:
        A = 16204380dc404171bebd99af2504cb14
        B = 61dec96f5aed4cd2a47814de42d721e6
        C = d1689e073948415a901c64a9e9269416
      
      The consensus matrix is:
       Config source |         Voters         | Current term | Config index | Committed?
      ---------------+------------------------+--------------+--------------+------------
       master        | A   B*  C              |              |              | Yes
       A             | A   B   C              | 2            | 305          | Yes
       B             |     B   C              | 2            | 307          | No
       C             | [config not available] |              |              |
      

      Permanently failed + eviction in-flight:

      Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING
        16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING
        61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER]
        d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): missing
      
      2 replicas' active configs differ from the master's.
        All the peers reported by the master and tablet servers are:
        A = 16204380dc404171bebd99af2504cb14
        B = 61dec96f5aed4cd2a47814de42d721e6
        C = d1689e073948415a901c64a9e9269416
      
      The consensus matrix is:
       Config source |         Voters         | Current term | Config index | Committed?
      ---------------+------------------------+--------------+--------------+------------
       master        | A   B*  C              |              |              | Yes
       A             | A   B   C              | 2            | 305          | Yes
       B             |     B   C              | 2            | 307          | No
       C             | [config not available] |              |              |
      

      The former case is resolved by tombstoned voting (KUDU-871), while the latter is made much, much less likely by 3-4-3 replication (KUDU-1097).

      However, tablets still get stuck on older versions, and it shouldn't be too hard to enhance ksck to detect and automatically fix these two situations by tablet copying B -> C and aborting the config change on B, respectively.

      Attachments

        Issue Links

          Activity

            People

              wdberkeley William Berkeley
              wdberkeley William Berkeley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: