Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2453

kudu should stop creating tablet infinitely

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 1.4.0, 1.7.2
    • None
    • master, tserver
    • None

    Description

      I have met this problem again on 2018/10/26. And now the kudu version is 1.7.2.

      -----------------------------------------------------

      We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and there are some load on the kudu cluster. Then someone else created a big table which had tens of thousands of tablets from impala-shell (that was a mistake). 

      CREATE TABLE XXX(
      ...
         PRIMARY KEY (...)
      )
      PARTITION BY HASH (...) PARTITIONS 100,
      RANGE (...)
      (
        PARTITION "2018-10-24" <= VALUES < "2018-10-24\000",
        PARTITION "2018-10-25" <= VALUES < "2018-10-25\000",
        ...
        PARTITION "2018-12-07" <= VALUES < "2018-12-07\000"
      )
      STORED AS KUDU
      TBLPROPERTIES ('kudu.master_addresses'= '...');
      

      Here are the logs after creating table (only pick one tablet as example):

      ------------------------------------------------------Kudu-master log
      ======e884bda6bbd3482f94c07ca0f34f99a4======
      W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:50247 dropped due to backpressure. The service queue is full; it has 512 items.
      I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1)
      ...
      
      ======Be replaced by 0b144c00f35d48cca4d4981698faef72======
      W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72
      ...
      I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4
      ...
      I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST)
      ...
      W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 already in progress: creating tablet
      ...
      I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1)
      ...
      W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:59735 dropped due to backpressure. The service queue is full; it has 512 items.
      I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1)
      ...
      
      ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75======
      W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75
      ...
      
      
      ------------------------------------------------------Kudu-tserver log
      I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= VALUES < "2018-11-10\000"
      ...
      I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Registered tablet (data state: TABLET_DATA_READY)
      ...
      I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247
      ...
      I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrapping tablet
      I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrap starting.
      ...
      I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No blocks or log segments found. Creating new log.
      ...
      I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No bootstrap required, opened a new log
      I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s
      I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247
      I1024 12:03:31.080996 126376 raft_consensus.cc:304] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
      I1024 12:03:31.081007 126376 raft_consensus.cc:330] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a prompt election more likely
      I1024 12:03:31.081014 126376 raft_consensus.cc:605] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Becoming Follower/Learner. State: Replica: 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER
      I1024 12:03:31.081053 126376 consensus_queue.cc:226] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
      ...
      I1024 12:03:35.315024 161474 raft_consensus.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
      I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 2.642s (starting election)
      I1024 12:03:35.315111 161474 raft_consensus.cc:458] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } }
      I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247
      I1024 12:03:35.315763 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer cd75ddaabaa8487b9fe1feab9d2cba83
      I1024 12:03:35.315790 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 061221dd1c3e40a3a3338afc74bb66f5
      W1024 12:03:35.315987 137266 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote when last-logged opid is not known
      W1024 12:03:35.316056 137267 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote when last-logged opid is not known
      I1024 12:03:35.316074 137267 leader_election.cc:258] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost.
      I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 4.310s (election complete)
      I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve majority
      ...
      I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent starting tablet: real 30.222s user 0.000s sys 0.000s
      W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet startup took 30222ms
      ...
      W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace:
      1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue
      1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call
      1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet manager lock
      1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new metadata...
      1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed
      1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus instance
      1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response
      1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping tablet
      1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet replica
      1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation
      1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus
      Metrics: {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354}
      

      Tablet startup took 30+ seconds.

      It was a long time for him to wait, so he did "ctrl+c". But we found that the tablets in 'INITIALIZED' status was growing rapidly, half an hour later it was 350,000

      100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).

      (created tablets: 60m * 60s / 30+s * 12(threads) = 1440 (tablets per hour))

      We deleted this table by kudu client tool, and found that the number of 'INITIALIZED' tablets was going down slowly. By simple estimating it will take 10+ days to be back to normal.  But luckily, the application system are not affected.

      ------------------------------------------------------Kudu-master log with another tablet uuid
      I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST)
      W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 already in progress: creating tablet
      I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6)
      
      ------------------------------------------------------Kudu-tserver log
      I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247
      I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P 39f15fcf42ef45bba0c95a3223dc25ee: FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: real 0.416s	user 0.404s	sys 0.008s Metrics: {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4}
      I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247
      I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C
       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              helifu LiFu He
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: