Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3195

Make DMS flush policy more robust when maintenance threads are idle

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13.0
    • Fix Version/s: 1.14.0
    • Component/s: tserver
    • Labels:
      None

      Description

      In one scenario I observed very long bootstrap times of tablet servers (something between to 45 minutes and 60 minutes) even if tablet servers had relatively small amount of data under management (~80GByte). It turned out the time was spent on replaying WAL segments, with kudu cluster ksck reporting something like below all the time during bootstrap:

        b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running
          State:       BOOTSTRAPPING
          Data state:  TABLET_DATA_READY
          Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} inserts{seen=5949247 
      ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7)
      

      The workload I ran before shutting down the tablet servers consisted of many small UPSERT operations, but the cluster was idle after terminating the workload for long time (about few hours or so). The workload was generated by

      kudu perf loadgen \
        --table_name=$TABLE_NAME \
        --num_rows_per_thread=800000000 \
        --num_threads=4 \
        --use_upsert \
        --use_random_pk \
        $MASTER_ADDR
      

      The table that the UPSERT workload was running against had been pre-populated by the following:

      kudu perf loadgen --table_num_replicas=3 --keep-auto-table --table_num_hash_partitions=5 --table_num_range_partitions=5 --num_rows_per_thread=800000000 --num_threads=4 $MASTER_ADDR
      

      As it turned out, tablet servers accumulated huge number of DMS which required flushing/compaction, but after the memory pressure subsided, the compaction policy was scheduling just one operation per tablet in every 120 seconds (the latter interval is controlled by --flush_threshold_secs). In fact, tablet servers could flush those rowsets non-stop since the maintenance threads were completely idle otherwise and there were no active workload running against the cluster. Those DMS has been around for long time (much more than 120 seconds) and were anchoring a lot of WAL segments. So, the operations from the WAL had to be replayed once I restarted the tablet servers.

      It would be great to update the flushing/compaction policy to allow tablet servers run FlushDeltaMemStoresOp as soon as a DMS becomes older than specified by --flush_threshold_secs when the maintenance threads are not busy otherwise.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aserbin Alexey Serbin
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: