Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3447

Limit the usage of network bandwidth of tablet copying

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Copying tablets from an old cluster to another new cluster is a high resource consumed operation using the command : kudu local_replica copy_from_remote. As the follow picture shows: the usage of memory is as high as 75%. And the network is almost occupied fully (the overall network bandwidth is 2Gb/s). Disk reading is every high (the overall disk bandwidth is 200MB/s). 

      If the data size is very large, the copying process will last for a long time. Other service maybe get impacted and become unavailable. Therefore it is better to limit the tablets copying speed and make the system more stable. The goal is to balance the tablets copying speed and the impact to other services.

      As copy_from_remote is mainly downloading data from the remote cluster and write the data to local file system, it is better to control the downloading speed to control the resource consumption. There are some algorithms to implement a rate limiter. This patch will use the token bucket algorithm implemented by Facebook Folly library: https://github.com/facebook/folly/blob/main/folly/TokenBucket.h

       

      Performance Tests

      1. Data size:

      TABLE test_1
      on disk size: 13263880213
      live row count: 66433035

      2. Test Case:

      case 1:

       kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir -tablet_copy_download_threads_nums_per_session=4 -num_threads=4

      case 2:

      kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 -enable_network_speed_limit=true -limit_network_speed=25

      3. Results:

      3.1 The usage of CPU

      Left is test case 1, right is 2. As we can seek, using speed limit feature can reduce CPU comsumption.

      3.2 Load of CPU

      Left is case 1, right is case 2. As we can see, using speed limit feature can reduce CPU Load.

      3.3 Network brandwidth

      Left is case 1, right is case 2. As we can see, using speed limit feature can limit the network to 25MB/s nearly.

      Attachments

        1. image-2023-02-09-10-38-50-512.png
          147 kB
          Xixu Wang
        2. image-2023-02-09-10-47-58-370.png
          152 kB
          Xixu Wang
        3. image-2023-02-13-17-08-37-256.png
          99 kB
          Xixu Wang
        4. image-2023-02-13-17-16-50-491.png
          171 kB
          Xixu Wang
        5. image-2023-02-13-17-22-25-368.png
          180 kB
          Xixu Wang
        6. image-2023-02-13-17-25-15-997.png
          231 kB
          Xixu Wang
        7. image-2023-02-13-17-32-11-650.png
          132 kB
          Xixu Wang

        Activity

          People

            Unassigned Unassigned
            wangxixu Xixu Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: