Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None

      Description

      This item introduces a discussion of how to reduce the time necessary to start a large cluster from tens of minutes to a handful of minutes.

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          All code changes done as part of other items.

          Show
          Robert Chansler added a comment - All code changes done as part of other items.
          Hide
          Konstantin Shvachko added a comment -

          After the two optimizations HADOOP-3364 and HADOOP-3369 the load time is improved by a factor of 2.
          The biggest progress is achieved in saving image and block processing, each of which is almost 4 times faster.

          • image saving is 4 times faster
          • block processing is 4 times faster

          The table below summarizes sizes and compares new and old time measurements.

            value vs
          objects 10 mln
          files & dirs 4 mln
          blocks 6 mln
          heap size 3.275 GB
          image size 0.6 GB
          edits size per day 0.27 GB
          1. data-nodes
          500
          blocks per node 36,000
          image load time 111 sec 132 sec
          edits load time 75 sec 84 sec
          image save time 18 sec 70 sec
          block processing 87 sec 320 sec
          total startup time 291 sec = 5 min 606 sec = 10 min

          This leads to the optimized startup time of 5 minutes, out of which

          load fsimage 38%
          load edits 26%
          save new fsimage 6%
          process block reports 30%

          I think more improvements can be made here especially in the loading part.
          For edits log we should optimize ADD and CLOSE transactions as noted in HADOOP-3364.
          For image loading it is probably block processing, but that needs to be evaluated.
          Leaving this issue open for now.

          Show
          Konstantin Shvachko added a comment - After the two optimizations HADOOP-3364 and HADOOP-3369 the load time is improved by a factor of 2. The biggest progress is achieved in saving image and block processing, each of which is almost 4 times faster. image saving is 4 times faster block processing is 4 times faster The table below summarizes sizes and compares new and old time measurements.   value vs objects 10 mln files & dirs 4 mln blocks 6 mln heap size 3.275 GB image size 0.6 GB edits size per day 0.27 GB data-nodes 500 blocks per node 36,000 image load time 111 sec 132 sec edits load time 75 sec 84 sec image save time 18 sec 70 sec block processing 87 sec 320 sec total startup time 291 sec = 5 min 606 sec = 10 min This leads to the optimized startup time of 5 minutes, out of which load fsimage 38% load edits 26% save new fsimage 6% process block reports 30% I think more improvements can be made here especially in the loading part. For edits log we should optimize ADD and CLOSE transactions as noted in HADOOP-3364 . For image loading it is probably block processing, but that needs to be evaluated. Leaving this issue open for now.
          Hide
          Konstantin Shvachko added a comment -

          Note. I updated the numbers in the tables above. Due to a calculation error the number for load edits was too high.

          > How do you measure only processing time?
          Yes I modified code and add logging lines printing processing times. It is in the patches linked to this issue.
          For measuring block report processing I use modified NNThroughputBenchmark.
          Therefore, it is an estimate not the real cluster processing time, but it is sort of an upper bound for how fast the name-node can process those reports.

          Show
          Konstantin Shvachko added a comment - Note. I updated the numbers in the tables above. Due to a calculation error the number for load edits was too high. > How do you measure only processing time? Yes I modified code and add logging lines printing processing times. It is in the patches linked to this issue. For measuring block report processing I use modified NNThroughputBenchmark. Therefore, it is an estimate not the real cluster processing time, but it is sort of an upper bound for how fast the name-node can process those reports.
          Hide
          Cagdas Gerede added a comment -

          How do you measure only processing time? Did you have to change the source code to add logging lines for these measurements or did you use the code as it is? The datanode records the time in log after it prepares and finishes sending the block report and the namenode reports after it leaves the safemode. I am trying to get similar benchmarks. Could you please clarify how you were able to only measure the processing time?

          Show
          Cagdas Gerede added a comment - How do you measure only processing time? Did you have to change the source code to add logging lines for these measurements or did you use the code as it is? The datanode records the time in log after it prepares and finishes sending the block report and the namenode reports after it leaves the safemode. I am trying to get similar benchmarks. Could you please clarify how you were able to only measure the processing time?
          Hide
          Konstantin Shvachko added a comment -

          Yes, replication = 3.
          Block report processing in my test includes only name-node processing. No preparation time, no RPC overhead.
          The processing time is proportional to the number of data-nodes (confirmed by the tests).
          This is because the name-node locks namespace for each block report and processes them sequentially one after another.

          Show
          Konstantin Shvachko added a comment - Yes, replication = 3. Block report processing in my test includes only name-node processing. No preparation time, no RPC overhead. The processing time is proportional to the number of data-nodes (confirmed by the tests). This is because the name-node locks namespace for each block report and processes them sequentially one after another.
          Hide
          Cagdas Gerede added a comment -

          What is the replica count in this benchmark?
          I am guessing it is 3 (6 million / 500 nodes =12,000 objects / per node. 36000 / 12000 = 3 replicas).
          Could you clarify?

          What does "process block reports" include? Does it include the time for generation of block reports in datanode and the time for namenode to receive the block reports? Or is it only the time to process all block reports not including receiving time?

          I was wondering how the numbers would be affected if you had the same number of objects but 1000 datanodes instead of 500 datanodes and 250 datanodes instead of 500 datanodes.
          Do you have any guess?

          Show
          Cagdas Gerede added a comment - What is the replica count in this benchmark? I am guessing it is 3 (6 million / 500 nodes =12,000 objects / per node. 36000 / 12000 = 3 replicas). Could you clarify? What does "process block reports" include? Does it include the time for generation of block reports in datanode and the time for namenode to receive the block reports? Or is it only the time to process all block reports not including receiving time? I was wondering how the numbers would be affected if you had the same number of objects but 1000 datanodes instead of 500 datanodes and 250 datanodes instead of 500 datanodes. Do you have any guess?
          Hide
          Konstantin Shvachko added a comment - - edited

          Estimates.

          The name-node startup consists of 4 major steps:

          1. load fsimage
          2. load edits
          3. save new merged fsimage
          4. process block report

          I am estimating the name-node startup time based on a 10 million objects image.
          The estimates are based on my experiments with a real cluster.
          The numbers are scaled proportionally (4.3) to 10 mln objects.

          Under the assumption that each file on average has 1.5 blocks 10 mln objects
          translates into 4 mln files and directories, and 6 mln blocks.
          Other claster parameters are summarized in the following table.

          objects 10 mln
          files & dirs 4 mln
          blocks 6 mln
          heap size 3.275 GB
          image size 0.6 GB
          image load time 132 sec
          image save time 70 sec
          edits size per day 0.27 GB
          edits load time 84 sec
          data-nodes 500
          blocks per node 36,000
          block report processing 320 sec
          total startup time 10 min

          This leads to the total startup time of 10 minutes, out of which

          load fsimage 22%
          load edits 14%
          save new fsimage 11%
          process block reports 53%

          Optimization.

          1. VM memory heap parameters play key role in the startup process as discussed in HADOOP-3248.
            It is highly recommended to set the initial heap size -Xms close to the maximum heap size because
            all that memory will be used by the name-node any way.
          2. Optimization of loading and saving is substantially related to reducing object allocation.
            In case of loading, object allocations are unavoidable, so we will need to make sure
            there are no intermediate allocations of temporary objects.
            In case of saving, all object allocations should be eliminated, which is done in HADOOP-3248.
          3. During loading for each file or directory the name-node performs a lookup in the
            namespace tree starting from the root in order to find the object's parent directory and then to insert the object into it.
            We can take advantage here of that directory children are not interleaving with children of other directories in the image.
            So we can find a parent once and then add then include all its children without repeating the search.
          4. Block reporting is the most expensive single part of the startup process.
            According to my experiments most of the processing time here goes into first adding all blocks into needed replication queue,
            and then removing them from the queue. During startup all blocks are guaranteed to be under-replicated in the beginning.
            But most of them and in the regular case all of them will be removed from that list.
            So in a sense we create dynamically a huge temporary structure just to make sure that all blocks have enough replicas.
            In addition to that in pre HADOOP-2606 versions (before release 0.17) the replication monitor would start processing
            those under-replicated blocks, and try to assign nodes for copying blocks.
            The structure works fine during regular operation because it contains only those blocks that are in fact under-replicated.
            The processing of block reports goes 5 times faster if the addition to the needed replications queue is removed.
          5. We also should reduce the number of block lookups in the blocksMap. I counted 5 lookups just in addstoredBlocks() while
            only one lookup is necessary because the name-space is locked and there is only thread that modifies it.
          Show
          Konstantin Shvachko added a comment - - edited Estimates. The name-node startup consists of 4 major steps: load fsimage load edits save new merged fsimage process block report I am estimating the name-node startup time based on a 10 million objects image. The estimates are based on my experiments with a real cluster. The numbers are scaled proportionally (4.3) to 10 mln objects. Under the assumption that each file on average has 1.5 blocks 10 mln objects translates into 4 mln files and directories, and 6 mln blocks. Other claster parameters are summarized in the following table. objects 10 mln files & dirs 4 mln blocks 6 mln heap size 3.275 GB image size 0.6 GB image load time 132 sec image save time 70 sec edits size per day 0.27 GB edits load time 84 sec data-nodes 500 blocks per node 36,000 block report processing 320 sec total startup time 10 min This leads to the total startup time of 10 minutes, out of which load fsimage 22% load edits 14% save new fsimage 11% process block reports 53% Optimization. VM memory heap parameters play key role in the startup process as discussed in HADOOP-3248 . It is highly recommended to set the initial heap size -Xms close to the maximum heap size because all that memory will be used by the name-node any way. Optimization of loading and saving is substantially related to reducing object allocation. In case of loading, object allocations are unavoidable, so we will need to make sure there are no intermediate allocations of temporary objects. In case of saving, all object allocations should be eliminated, which is done in HADOOP-3248 . During loading for each file or directory the name-node performs a lookup in the namespace tree starting from the root in order to find the object's parent directory and then to insert the object into it. We can take advantage here of that directory children are not interleaving with children of other directories in the image. So we can find a parent once and then add then include all its children without repeating the search. Block reporting is the most expensive single part of the startup process. According to my experiments most of the processing time here goes into first adding all blocks into needed replication queue, and then removing them from the queue. During startup all blocks are guaranteed to be under-replicated in the beginning. But most of them and in the regular case all of them will be removed from that list. So in a sense we create dynamically a huge temporary structure just to make sure that all blocks have enough replicas. In addition to that in pre HADOOP-2606 versions (before release 0.17) the replication monitor would start processing those under-replicated blocks, and try to assign nodes for copying blocks. The structure works fine during regular operation because it contains only those blocks that are in fact under-replicated. The processing of block reports goes 5 times faster if the addition to the needed replications queue is removed. We also should reduce the number of block lookups in the blocksMap. I counted 5 lookups just in addstoredBlocks() while only one lookup is necessary because the name-space is locked and there is only thread that modifies it.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Robert Chansler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development