Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1457

Limit transmission rate when transfering image between primary and secondary NNs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Add a configuration variable dfs.image.transfer.bandwidthPerSec to allow the user to specify the amount of bandwidth for transferring image and edits. Its default value is 0 indicating no throttling.

      Description

      If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem.

      1. checkpoint-limitandcompress.patch
        13 kB
        Yilei Lu
      2. trunkThrottleImage.patch
        19 kB
        Hairong Kuang
      3. trunkThrottleImage1.patch
        20 kB
        Hairong Kuang
      4. trunkThrottleImage2.patch
        19 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          This jira description copied Lu's comment from HDFS-1435.

          Show
          Hairong Kuang added a comment - This jira description copied Lu's comment from HDFS-1435 .
          Hide
          Yilei Lu added a comment -

          If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem

          LZO compression codec is not supported in Hadoop standard package. So default copression is GzipCodec.

          Show
          Yilei Lu added a comment - If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem LZO compression codec is not supported in Hadoop standard package. So default copression is GzipCodec.
          Hide
          Hairong Kuang added a comment -

          In DataXceiver, I already introduced a throttler for the purpose of balancer. Could we use Throttler to limit transmission rate instead?

          Show
          Hairong Kuang added a comment - In DataXceiver, I already introduced a throttler for the purpose of balancer. Could we use Throttler to limit transmission rate instead?
          Hide
          Hairong Kuang added a comment -

          1. Rename BlockTransferThrottler to be org.apache.hadoop.hdfs.util.DataTransferThrottler;
          2. Add a configuration parameter that specifies image transfer rate;
          2. Throttle image/edits transfer at the sender side.

          Show
          Hairong Kuang added a comment - 1. Rename BlockTransferThrottler to be org.apache.hadoop.hdfs.util.DataTransferThrottler; 2. Add a configuration parameter that specifies image transfer rate; 2. Throttle image/edits transfer at the sender side.
          Hide
          Yilei Lu added a comment -

          To Hairong. It is a good idea. If we transport uncompression data, we should do it as same as your great idea. But If the fsimage is not compressed, It will be better for limit transmission.
          We use the method of limit and compression transmission in Baidu. And it runs well.

          Show
          Yilei Lu added a comment - To Hairong. It is a good idea. If we transport uncompression data, we should do it as same as your great idea. But If the fsimage is not compressed, It will be better for limit transmission. We use the method of limit and compression transmission in Baidu. And it runs well.
          Hide
          Hairong Kuang added a comment -

          Right, the support for compressed image has already checked in in HDFS-1435. So I expect that a user with a big image runs NN with image compression enabled, which in addition will reduce the cost of saving the image to local or remote disks at NN.

          Then this jira simply supports image transfer throttling. Does this make sense? I am glad to hear that Baidu is running Hadoop well. You guys have done good job with Hadoop.

          Show
          Hairong Kuang added a comment - Right, the support for compressed image has already checked in in HDFS-1435 . So I expect that a user with a big image runs NN with image compression enabled, which in addition will reduce the cost of saving the image to local or remote disks at NN. Then this jira simply supports image transfer throttling. Does this make sense? I am glad to hear that Baidu is running Hadoop well. You guys have done good job with Hadoop.
          Hide
          Dmytro Molkov added a comment -

          I like this approach overall. The patch looks good (not much going on in there except for the move of the class).
          The only comment is on the default value for bandwidth. I know without limiting we can get speeds up to 10 times greater than the default you are putting in conf. Do you think it might make sense to have a default behaviour be unlimited? And only people who want to turn the feature on will specify some limit?

          Show
          Dmytro Molkov added a comment - I like this approach overall. The patch looks good (not much going on in there except for the move of the class). The only comment is on the default value for bandwidth. I know without limiting we can get speeds up to 10 times greater than the default you are putting in conf. Do you think it might make sense to have a default behaviour be unlimited? And only people who want to turn the feature on will specify some limit?
          Hide
          Hairong Kuang added a comment -

          Right, having a large default makes sense. It will make this change backward compatible. What does the community think?

          Show
          Hairong Kuang added a comment - Right, having a large default makes sense. It will make this change backward compatible. What does the community think?
          Hide
          Hairong Kuang added a comment -

          How about I use a default value of 0 to indicate that no need to throttle?

          Show
          Hairong Kuang added a comment - How about I use a default value of 0 to indicate that no need to throttle?
          Hide
          Hairong Kuang added a comment -

          This patch sets the default transfer rate to be 0 indicating throttling is disabled.

          Show
          Hairong Kuang added a comment - This patch sets the default transfer rate to be 0 indicating throttling is disabled.
          Hide
          Hairong Kuang added a comment -

          AntPatch result:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to i
          [exec] nclude 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 98 release audit warnings (more than the trunk's current 1 warnings).
          [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.
          [exec]
          Strangely although the output complains that this patch generates one more release audit warnings, the trunk also reports 98 release audit warnings, all before of no license header. My patch renamed a class and I checked the new file, it does have the apache license header.

          Ant test passed except for the known failed ones.

          Show
          Hairong Kuang added a comment - AntPatch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to i [exec] nclude 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 98 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec] Strangely although the output complains that this patch generates one more release audit warnings, the trunk also reports 98 release audit warnings, all before of no license header. My patch renamed a class and I checked the new file, it does have the apache license header. Ant test passed except for the known failed ones.
          Hide
          Dmytro Molkov added a comment -

          Cool. With the default rate at 0 this patch looks good to me. Thanks

          Show
          Dmytro Molkov added a comment - Cool. With the default rate at 0 this patch looks good to me. Thanks
          Hide
          Konstantin Shvachko added a comment -

          Looks like imageTransferThrottler is instantiated but not used in FSImage. Should it be moved into GetImageServlet instead?

          Show
          Konstantin Shvachko added a comment - Looks like imageTransferThrottler is instantiated but not used in FSImage . Should it be moved into GetImageServlet instead?
          Hide
          Hairong Kuang added a comment -

          I thought about it, but GetImageServlet does not have a non-default constructor and has no reference to it in FSImage.

          Show
          Hairong Kuang added a comment - I thought about it, but GetImageServlet does not have a non-default constructor and has no reference to it in FSImage.
          Hide
          Hairong Kuang added a comment -

          Another option is to initialize the throttler in NameNode and then pass it to the http servlet throught context.

          Show
          Hairong Kuang added a comment - Another option is to initialize the throttler in NameNode and then pass it to the http servlet throught context.
          Hide
          Ramkumar Vadali added a comment -

          @Hairong, I see a lot of release audit warnings in a clean MR checkout too. I think this is due to HADOOP-7008. Please see MAPREDUCE-2172 for this.

          Show
          Ramkumar Vadali added a comment - @Hairong, I see a lot of release audit warnings in a clean MR checkout too. I think this is due to HADOOP-7008 . Please see MAPREDUCE-2172 for this.
          Hide
          Hairong Kuang added a comment -

          After talking with Konstantin, we decided to create a throttler on the fly on each file transfer. This patch does this.

          Show
          Hairong Kuang added a comment - After talking with Konstantin, we decided to create a throttler on the fly on each file transfer. This patch does this.
          Hide
          Konstantin Shvachko added a comment -

          +1 This looks good.

          Show
          Konstantin Shvachko added a comment - +1 This looks good.
          Hide
          Hairong Kuang added a comment -

          I've just committed this. Thanks Lu!

          Show
          Hairong Kuang added a comment - I've just committed this. Thanks Lu!
          Hide
          Hairong Kuang added a comment -

          I performed a manual test to make sure that the configured bandwidth value gets passed to GetImageServlet.

          Show
          Hairong Kuang added a comment - I performed a manual test to make sure that the configured bandwidth value gets passed to GetImageServlet.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development