Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2206

Create table timeout due to too many DRS in one tablet cause lock contention

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.6.0
    • None
    • None

    Description

      We encountered rpc timeout exception when we use sparksql, which use Java kudu client innerly, to create table on kudu cluster. The cluster has 10 tserver and 1 master on 10 machines, the target table has 10 range partitions and 5 hash partitions.

      From the web UI, I found it spent about 3 minutes before all the tablets vote a leader, and I can see a lot delete tablet records in the UI like:
      Delete Tablet Running 2.13 min 719f0f496bc34a469e4069b2861b4be8 Delete Tablet RPC for TS=044f1da9a27c46acb82b1386f829f4dc

      Also I find many retry records in tserver logs, like:
      W1031 23:04:40.088256 5816 consensus_peers.cc:357] T fcde65c4e4cf4df29b9ef9884ce292b2 P 0f53a0d3ef7e44ebb0365c800752d5bd -> Peer 23f962e4a1744381ad5fa0d2d8b10241 (c3-kudu-tst-st07.bj:18700): Couldn't send request to peer 23f962e4a1744381ad5fa0d2d8b10241 for tablet fcde65c4e4cf4df29b9ef9884ce292b2. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 94 times.

      You can find the logs of master and tserver since master receive the create table request in the attachment.

      The kudu version is 1.3.0, the nearest commit is 00813f96b9cb0c9ec57a17e5c85242f7679db0e0

      The exception that client received is like:
      Error: org.apache.kudu.client.NonRecoverableException: RPC can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=25, DeadlineTracker(timeout=30000, elapsed=28499), Traces: [0ms] sending RPC to server , [0ms] received from server response OK, [20ms] sending RPC to server , [20ms] received from server response OK, [40ms] sending RPC to server , [40ms] received from server response OK, [59ms] sending RPC to server , [60ms] received from server response OK, [80ms] sending RPC to server , [80ms] received from server response OK, [100ms] sending RPC to server , [100ms] received from server response OK, [140ms] sending RPC to server , [141ms] received from server response OK, [200ms] sending RPC to server , [200ms] received from server response OK, [319ms] sending RPC to server , [320ms] received from server response OK, [780ms] sending RPC to server , [780ms] received from server response OK, [2740ms] sending RPC to server , [2741ms] received from server response OK, [3580ms] sending RPC to server , [3580ms] received from server response OK, [4840ms] sending RPC to server , [4840ms] received from server response OK, [7080ms] sending RPC to server , [7081ms] received from server response OK, [8320ms] sending RPC to server , [8321ms] received from server response OK, [11620ms] sending RPC to server , [11621ms] received from server response OK, [13540ms] sending RPC to server , [13540ms] received from server response OK, [16819ms] sending RPC to server , [16820ms] received from server response OK, [19020ms] sending RPC to server , [19020ms] received from server response OK, [21340ms] sending RPC to server , [21341ms] received from server response OK, [24660ms] sending RPC to server , [24661ms] received from server response OK, [26800ms] sending RPC to server , [26800ms] received from server response OK, [27660ms] sending RPC to server , [27660ms] received from server response OK, [28480ms] sending RPC to server , [28481ms] received from server

      Attachments

        1. trace_tserver07_trace.json
          280 kB
          ZhangZhen
        2. tserver07.flags
          9 kB
          ZhangZhen
        3. pstack.zip
          151 kB
          ZhangZhen
        4. tserver_07_23f962e4a1.log
          1.52 MB
          ZhangZhen
        5. tsever_02_0a8bbcbb.log
          1.09 MB
          ZhangZhen
        6. kudu_master.log
          1.41 MB
          ZhangZhen
        7. tserver_01_0f53a0d3.log
          1006 kB
          ZhangZhen

        Activity

          People

            tlipcon Todd Lipcon
            zhquake@gmail.com ZhangZhen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: