Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2152

Tablet stuck under-replicated after some kind of tablet copy issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • 1.8.0
    • consensus
    • None

    Description

      I was stress testing with the following setup:

      • 8 servers (n1-standard-4 GCE boxes)
      • created a bunch of 100-tablet tablets using loadgen until I had ~2500 replicas on each server
      • mounted another server using sshfs and put cmeta on that mount point (to make slower cmeta writes)
      • stress -c4 on all machines
      • shut down a server and wait for re-replication (green ksck), restart the server, rinse repeat

      Eventually I got a stuck tablet. ksck reports:

      Tablet 271df8901d98442cb478593babd8a609 of table 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 replica(s) not RUNNING
        20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING [LEADER]
        c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad state
          State:       STOPPED
          Data state:  TABLET_DATA_COPYING
          Last status: Deleted tablet blocks from disk
        cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING
      
      1 replicas' active configs differ from the master's.
        All the peers reported by the master and tablet servers are:
        A = 20d4d86f182043398594b67492d13fdc
        D = 471027436ee8405ab7cdf8d22407696b
        B = c2ea8f22f4034bcc97e26c9236811960
       
       C = cd0997b908ad41839f56a1b61210f2d4
      
      The consensus matrix is:
       Config source |      Voters      | Current term | Config index | Committed?
      ---------------+------------------+--------------+--------------+------------
       master        | A*      B   C    |              |              | Yes
       A             | A*      B   C    | 11           | 29           | Yes
       B             |     D   B   C    | 9            | 23           | Yes
       C             | A*      B   C    | 11           | 29           | Yes
      

      The leader ("A" above) just keeps reporting that it's failing to send requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming that it actually means BOOTSTRAPPING).

      "B"'s last bit in the logs were:

      I0920 16:41:48.556422  3808 tablet_copy_client.cc:209] T 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet copy: Beginning tablet copy session from remote peer at address kudu513-8.gce.cloudera.com:7050
      I0920 16:41:48.562335  3808 ts_tablet_manager.cc:1118] T 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting tablet data with delete state TABLET_DATA_COPYING
      W0920 16:41:48.578610  3808 env_util.cc:277] Failed to determine if path is a directory: /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: Not found: /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: No such file or directory (error 2)
      

      Attachments

        Issue Links

          Activity

            People

              awong Andrew Wong
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: