Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3565

Avoid scheduling multiple IO-heavy maintenance operations at once for the same tablet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
    • None
    • master, tserver
    • None

    Description

      Current implementation of scheduling maintenance operations inadvertently induces much more disk IO demand than it could. The maintenance manager thread can schedule multiple IO-heavy operations on the same tablet to run concurrently when there is more than one maintenance worker thread. Sometimes, it creates bottlenecks by saturating disk IO throughput since the maintenance worker threads access data backed by the same set of drives when working on the same tablet.

      Apparently, such behavior doesn't scale well when increasing the number of maintenance worker threads, especially when data directories are backed by spinning drives (HDD).

      In the presence of four maintenance threads I saw the logs like below. Notice that when four operations are scheduled to run almost at the same time, it takes more than 3 seconds to run each. Other instances of similar flush operations at the same tablet take less than 100ms to complete when no other flush operation on the same tablet is running. Notice that not only fdatasync_us is higher for concurrently running operations, but some of them now waiting on synchronization primitives, blocking the maintenance threads from doing other useful work, e.g. compacting/flushing data of other active tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding synchronization primitives as a writer, blocks concurrently running insert/update/delete operations since they acquire the same synchronization primitives as readers when checking for a primary key's presence. So, long running flush operations induce an extra latency into concurrently running write workloads.

      It's necessary to improve the scheduling algorithm of maintenance operations, differentiating operations not only by their score, but taking into account the IO impact and dependencies between them. As a simplest approach, it's necessary to add an option to enable the following behavior:

      • an IO-heavy operation (such as flush, merge compaction, etc.) should not be scheduled when there is another one scheduled to run or already running at the same tablet
      I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035118I0328
      I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 0.084s     user 0.017s     sys 0.005s Metrics: {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
      ...
      I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
      I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
      I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
      I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P db78de4070874e75be1482b02b5161e6: Scheduling FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
      ...
      I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.280s     user 0.021s     sys 0.005s Metrics: {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
      I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.278s     user 0.021s     sys 0.000s Metrics: {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
      ...
      I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.219s     user 0.018s     sys 0.003s Metrics: {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
      I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P db78de4070874e75be1482b02b5161e6: FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 3.159s     user 0.015s     sys 0.002s Metrics: {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            aserbin Alexey Serbin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: