Hadoop Common
  1. Hadoop Common
  2. HADOOP-3860

Compare name-node performance when journaling is performed into local hard-drives or nfs.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.0
    • Component/s: benchmarks
    • Labels:
      None

      Description

      The goal of this issue is to measure how the name-node performance depends on where the edits log is written to.
      Three types of the journal storage should be evaluated:

      1. local hard drive;
      2. remote drive mounted via nfs;
      3. nfs filer.
      1. NNThruputMoreOps.patch
        21 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Konstantin Shvachko added a comment -

          Dhruba,
          For creates we definitely have disk IO contention not the cpu.
          About H2330, Hairong tested it with her new synthetic load generator - very encouraging results.

          Show
          Konstantin Shvachko added a comment - Dhruba, For creates we definitely have disk IO contention not the cpu. About H2330, Hairong tested it with her new synthetic load generator - very encouraging results.
          Hide
          dhruba borthakur added a comment -

          Hi Konstantin,

          Great analysis. I completely agree with you that coarse-grain locking for the namenode should not be impacting scalability of opens and creates. It is the disk sync times that really matters. BTW, when you ran the test on a single disk on local drive, did you see the disk max-out on IO? You said that 5710 creates occured, the limitation being CPU on the machine or disk IO contention?

          Also, I had a patch HADOOP-2330 that pre-allocated transaction log. If I had seen this JIRA earlier, i would have requested you to see if you could repeat the exact same test on the same hardware with this patch. This patch pre-allocates the transaction log in large chunks.

          Show
          dhruba borthakur added a comment - Hi Konstantin, Great analysis. I completely agree with you that coarse-grain locking for the namenode should not be impacting scalability of opens and creates. It is the disk sync times that really matters. BTW, when you ran the test on a single disk on local drive, did you see the disk max-out on IO? You said that 5710 creates occured, the limitation being CPU on the machine or disk IO contention? Also, I had a patch HADOOP-2330 that pre-allocated transaction log. If I had seen this JIRA earlier, i would have requested you to see if you could repeat the exact same test on the same hardware with this patch. This patch pre-allocates the transaction log in large chunks.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.
          Please feel free to comment, discuss the benchmark results.

          Show
          Konstantin Shvachko added a comment - I just committed this. Please feel free to comment, discuss the benchmark results.
          Hide
          Raghu Angadi added a comment -

          It looks as though it is not just the number of mutations but something else matters as well(may be amount of data written to edits log per mutation, cpu, or locking). That could explain large disparity between number creates, renames, and deletes though each of these is single mutation.

          Show
          Raghu Angadi added a comment - It looks as though it is not just the number of mutations but something else matters as well(may be amount of data written to edits log per mutation, cpu, or locking). That could explain large disparity between number creates, renames, and deletes though each of these is single mutation.
          Hide
          Konstantin Shvachko added a comment -

          I edited a typo in the formula explaining throughput:
          – 1,000,000/(tE-tE)
          + 1,000,000/(tE-tB)

          Show
          Konstantin Shvachko added a comment - I edited a typo in the formula explaining throughput: – 1,000,000/(tE-tE) + 1,000,000/(tE-tB)
          Hide
          Konstantin Shvachko added a comment -

          You probably know that better than I do. But the point of the benchmarking was to compare nfs vs local drives.
          There was a suspicion that ios to nfs are substantially slower then to local drives, and it turned out to be pretty much the same.
          It would even better of course if we could fine tune nfs.

          Show
          Konstantin Shvachko added a comment - You probably know that better than I do. But the point of the benchmarking was to compare nfs vs local drives. There was a suspicion that ios to nfs are substantially slower then to local drives, and it turned out to be pretty much the same. It would even better of course if we could fine tune nfs.
          Hide
          Allen Wittenauer added a comment -

          NFS is a black art: when doing benchmarks such as these, implementation matters. Are we using NFSv2? v3? v4? UDP or TCP? What is the rwsize set to? What is the server side and what is the client side? What about TCP/IP tuning?

          Show
          Allen Wittenauer added a comment - NFS is a black art: when doing benchmarks such as these, implementation matters. Are we using NFSv2? v3? v4? UDP or TCP? What is the rwsize set to? What is the server side and what is the client side? What about TCP/IP tuning?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387153/NNThruputMoreOps.patch
          against trunk revision 680823.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387153/NNThruputMoreOps.patch against trunk revision 680823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2979/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment - - edited

          I benchmarked three operations: create, rename, and delete using NNThroughputBenchmark, which is a pure name-node benchmark. It calls the name-node methods directly without using the rpc protocol. So the rpc overhead is not included in these results, and should be measured separately say with synthetic load generator.
          In a sense these benchmarks determine an upper bound for the HDFS operations, namely the maximum throughput the name-node can sustain under heavy load.

          Each run starts with an empty files system and performs 1 million operations handled by 256 threads on the name-node. The output is the throughput that is the number of operation per second, which is calculated as 1,000,000/(tE-tB), where tB is when the first thread starts, and tE is when all threads stop. The threads run in parallel.
          Creates create empty files and do not close them. Renames change file names, but do not move them.
          All test results are consistent except for one distortion in deletes on a remote drive, which is way out of the expected range. Don't know what that is, one day they were good the other not.

          Each test consists of 1,000,000 operations performed using 256 threads.
          Result is in ops/sec.

          Log to open create (no close) rename delete
          none 126,119      
          1 Local HD   5,710 8,400 20,690
          1 NFS HD   5,600 8,290 12,090
          1 NFS Filer   5,676 8,134 21,100
          4 Local HD   5,210    
          3 loc HD, 1 NFS HD   5,150    

          Some conclusions:

          • Local drive is faster than nfs, and
          • nfs filer is faster than a remote drive;
          • but the difference between nfs storage and local drives is very slim, only 2-3%.
          • Using 4 local drives instead of 1 degrades the performance by only 9%, even though we write onto the drives sequentially (one after another).
            It would be fair to say that there is some parallelism in writing, since current code batches writes first and then synchs them at once in larges chunks. So while the writes are sequential the synchs are parallel.
          • Opens (getBlockLocation()) are 22 times faster than creates,
          • which means journaling is the real bottleneck for the name-node operations,
          • and the lack of fine-grained locking in the namespace data-structures is not a problem so far. Otherwise, the throughputs for opens and other operations would be characterized by the same or at least close numbers.
          • Further optimization of the name-node performance imo should be focused around efficient journaling.

          Another set of statistical data, which characterizes the actual load on the name-node on some of our clusters. Unfortunately, the statistics for open is broken, and we do not collect stats for renames. So I can only present creates and deletes. Please contribute if somebody has more data.

          Actual load (ops/sec) open create delete
          peak   144 6460
          avarage   11 50
          • These numbers show that the actual peak load for creates is about 40 times lower than the name-node can handle, and 3 times lower for deletes. On average the picture is even more drastic.
            The name-node processing capability is 400-500 times higher than the actual average load on it.
          Show
          Konstantin Shvachko added a comment - - edited I benchmarked three operations: create , rename , and delete using NNThroughputBenchmark , which is a pure name-node benchmark. It calls the name-node methods directly without using the rpc protocol. So the rpc overhead is not included in these results, and should be measured separately say with synthetic load generator. In a sense these benchmarks determine an upper bound for the HDFS operations, namely the maximum throughput the name-node can sustain under heavy load. Each run starts with an empty files system and performs 1 million operations handled by 256 threads on the name-node. The output is the throughput that is the number of operation per second, which is calculated as 1,000,000/(tE-tB), where tB is when the first thread starts, and tE is when all threads stop. The threads run in parallel. Creates create empty files and do not close them. Renames change file names, but do not move them. All test results are consistent except for one distortion in deletes on a remote drive, which is way out of the expected range. Don't know what that is, one day they were good the other not. Each test consists of 1,000,000 operations performed using 256 threads. Result is in ops/sec . Log to open create (no close) rename delete none 126,119       1 Local HD   5,710 8,400 20,690 1 NFS HD   5,600 8,290 12,090 1 NFS Filer   5,676 8,134 21,100 4 Local HD   5,210     3 loc HD, 1 NFS HD   5,150     Some conclusions: Local drive is faster than nfs, and nfs filer is faster than a remote drive; but the difference between nfs storage and local drives is very slim, only 2-3% . Using 4 local drives instead of 1 degrades the performance by only 9% , even though we write onto the drives sequentially (one after another). It would be fair to say that there is some parallelism in writing, since current code batches writes first and then synchs them at once in larges chunks. So while the writes are sequential the synchs are parallel. Opens (getBlockLocation()) are 22 times faster than creates, which means journaling is the real bottleneck for the name-node operations, and the lack of fine-grained locking in the namespace data-structures is not a problem so far. Otherwise, the throughputs for opens and other operations would be characterized by the same or at least close numbers. Further optimization of the name-node performance imo should be focused around efficient journaling . Another set of statistical data, which characterizes the actual load on the name-node on some of our clusters. Unfortunately, the statistics for open is broken, and we do not collect stats for renames. So I can only present creates and deletes. Please contribute if somebody has more data. Actual load (ops/sec) open create delete peak   144 6460 avarage   11 50 These numbers show that the actual peak load for creates is about 40 times lower than the name-node can handle, and 3 times lower for deletes. On average the picture is even more drastic. The name-node processing capability is 400-500 times higher than the actual average load on it.
          Hide
          Konstantin Shvachko added a comment -

          I am attaching a patch that was used for the benchmarks.
          It extends NNThroughputBenchmark with new operations rename and delete as well as introduces additional command line options,
          which control what the benchmarks do with generated files before and after the execution.

          Show
          Konstantin Shvachko added a comment - I am attaching a patch that was used for the benchmarks. It extends NNThroughputBenchmark with new operations rename and delete as well as introduces additional command line options, which control what the benchmarks do with generated files before and after the execution.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development