Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1846

Don't fill preallocated portion of edits log with 0x00

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HADOOP-2330 added a feature to preallocate space in the local file system for the NN transaction log. That change seeks past the current end of the file and writes out some data, which on most systems results in the intervening data in the file being filled with zeros. Most underlying file systems have special handling for sparse files, and don't actually allocate blocks on disk for blocks of a file which consist completely of 0x00.

      I've seen cases in the wild where the volume an edits dir is on fills up, resulting in a partial final transaction being written out to disk. If you examine the bytes of this (now corrupt) edits file, you'll see the partial final transaction followed by a lot of zeros, suggesting that the preallocation previously succeeded before the volume ran out of space. If we fill the preallocated space with something other than zeros, we'd likely see the failure at preallocation time, rather than transaction-writing time, and so cause the NN to crash earlier, without a partial transaction being written out.

      I also hypothesize that filling the preallocated space in the edits log with something other than 0x00 will result in a performance improvement in NN throughput. I haven't tested this yet, but I intend to as part of this JIRA.

      1. hdfs-1846.3.patch
        5 kB
        Aaron T. Myers
      2. editsStored
        1.00 MB
        Aaron T. Myers
      3. hdfs-1846.3.patch
        5 kB
        Aaron T. Myers
      4. hdfs-1846.2.patch
        4 kB
        Aaron T. Myers
      5. hdfs-1846.1.patch
        4 kB
        Aaron T. Myers
      6. hdfs-1846-perf-analysis.0.patch
        7 kB
        Aaron T. Myers
      7. hdfs-1846.0.txt
        2 kB
        Aaron T. Myers

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          7d 23h 18m 1 Aaron T. Myers 27/Apr/11 23:19
          Patch Available Patch Available Resolved Resolved
          7h 41m 1 Todd Lipcon 28/Apr/11 07:01
          Resolved Resolved Closed Closed
          200d 18h 51m 1 Arun C Murthy 15/Nov/11 00:52
          Colin Patrick McCabe made changes -
          Link This issue relates to HDFS-3510 [ HDFS-3510 ]
          Hide
          Todd Lipcon added a comment -

          Good point - the bug in the previous code was that we were preallocating by writing 1MB to position +1MB instead of just writing 1MB to position.
          fallocate would probably be a bit of an improvement, but this edit log isn't generally a big bottleneck. That said, on the DN side, it should make a difference. There's a patch floating around somewhere to add that - we just need to benchmark it to verify.

          Show
          Todd Lipcon added a comment - Good point - the bug in the previous code was that we were preallocating by writing 1MB to position +1MB instead of just writing 1MB to position. fallocate would probably be a bit of an improvement, but this edit log isn't generally a big bottleneck. That said, on the DN side, it should make a difference. There's a patch floating around somewhere to add that - we just need to benchmark it to verify.
          Hide
          Colin Patrick McCabe added a comment -

          Let me clear up one misconception here:

          Sparse files occur when you seek to a point past the end of a file, and write data.

          Sparse files DO NOT occur when you write out data using write(2). Writing zeros to a file will not result in a sparse file. The kernel does not have time to check every buffer to see whether it consists of all zeros. If you say that you have important data to write, it believes you.

          Using the fallocate system call is significantly faster than copying all of the data from userspace. This is true for two reasons:
          1. The copying solution needs to copy all those zeros (or 0xdeadbeefs, or whatever) from userspace to kernel space. The fallocate solution copies nothing.
          2. Filesystems that support extents, like ext4, can optimize their space layout if you tell them ahead of time that you want a big contiguous chunk of data.

          For performance reasons, it would be really good to add fallocate support. I do not believe that it would require another configuration knob. The native code would just need to have some kind of compile-time check that falls back on a non-fallocate solution if the environment is too old.

          Show
          Colin Patrick McCabe added a comment - Let me clear up one misconception here: Sparse files occur when you seek to a point past the end of a file, and write data. Sparse files DO NOT occur when you write out data using write(2). Writing zeros to a file will not result in a sparse file. The kernel does not have time to check every buffer to see whether it consists of all zeros. If you say that you have important data to write, it believes you. Using the fallocate system call is significantly faster than copying all of the data from userspace. This is true for two reasons: 1. The copying solution needs to copy all those zeros (or 0xdeadbeefs, or whatever) from userspace to kernel space. The fallocate solution copies nothing. 2. Filesystems that support extents, like ext4, can optimize their space layout if you tell them ahead of time that you want a big contiguous chunk of data. For performance reasons, it would be really good to add fallocate support. I do not believe that it would require another configuration knob. The native code would just need to have some kind of compile-time check that falls back on a non-fallocate solution if the environment is too old.
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Nathan Roberts added a comment -

          Thanks for the patch! I learned some things on this one that surprised me a little bit. If I read the test results above correctly, we're doing about 100 mkdirs per second, each results in ~250 bytes being added to the editlog. What I found very surprising was that this change made such a dramatic impact when the rate of adding to the log is actually not that fast. For example, I tried doing basically the same thing in native code (write(250), fdatasync(), preallocate every 1MB) and was seeing upwards of 5700 iops on a basic SATA drive. The difference between true preallocate and sparse preallocate was only 4 seconds over 100,000 transactions. So it wasn't immediately intuitive to me how the mkdirs test was seeing a 160 second difference for the same amount of IO. Then I realized it's probably just the cache on the drive and the way it is able to very quickly overwrite dirty blocks in the cache. This probably happens very frequently when running at 5700 iops, and far less frequently when running at 100 iops. In other words, in both cases (5700 native iops, 100 mkdirs), the # of physical seeks per second were probably about the same and this was the limiting factor. Since the mkdirs test took 2 orders of magnitude longer to run, the absolute delta in time should also be about 2 orders of magnitude different, which it was. Life is good again.

          So, the I/O subsystem is capable of significantly more IOPS than we're asking it to do. Is 100 mkdirs/sec considered good enough or is this something that ought to be looked into.

          Show
          Nathan Roberts added a comment - Thanks for the patch! I learned some things on this one that surprised me a little bit. If I read the test results above correctly, we're doing about 100 mkdirs per second, each results in ~250 bytes being added to the editlog. What I found very surprising was that this change made such a dramatic impact when the rate of adding to the log is actually not that fast. For example, I tried doing basically the same thing in native code (write(250), fdatasync(), preallocate every 1MB) and was seeing upwards of 5700 iops on a basic SATA drive. The difference between true preallocate and sparse preallocate was only 4 seconds over 100,000 transactions. So it wasn't immediately intuitive to me how the mkdirs test was seeing a 160 second difference for the same amount of IO. Then I realized it's probably just the cache on the drive and the way it is able to very quickly overwrite dirty blocks in the cache. This probably happens very frequently when running at 5700 iops, and far less frequently when running at 100 iops. In other words, in both cases (5700 native iops, 100 mkdirs), the # of physical seeks per second were probably about the same and this was the limiting factor. Since the mkdirs test took 2 orders of magnitude longer to run, the absolute delta in time should also be about 2 orders of magnitude different, which it was. Life is good again. So, the I/O subsystem is capable of significantly more IOPS than we're asking it to do. Is 100 mkdirs/sec considered good enough or is this something that ought to be looked into.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/)
          HDFS-1846. Preallocate edit log with OP_INVALID instead of zero bytes to ensure blocks are actually allocated. Contributed by Aaron T. Myers.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/ ) HDFS-1846 . Preallocate edit log with OP_INVALID instead of zero bytes to ensure blocks are actually allocated. Contributed by Aaron T. Myers.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #612 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/612/)
          HDFS-1846. Preallocate edit log with OP_INVALID instead of zero bytes to ensure blocks are actually allocated. Contributed by Aaron T. Myers.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #612 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/612/ ) HDFS-1846 . Preallocate edit log with OP_INVALID instead of zero bytes to ensure blocks are actually allocated. Contributed by Aaron T. Myers.
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Fix Version/s 0.23.0 [ 12315571 ]
          Resolution Fixed [ 1 ]
          Hide
          Todd Lipcon added a comment -

          Committed to trunk. Thanks Aaron and all those who reviewed.

          Show
          Todd Lipcon added a comment - Committed to trunk. Thanks Aaron and all those who reviewed.
          Hide
          Todd Lipcon added a comment -

          Just ran a sanity check where I disabled preallocation altogether. Average op time went from 9ms to 29ms on my disk (ext3, no special mount options, 7200RPM). I also verified that TestOfflineEditsViewer passes with the editsStored file that Aaron has attached here. I will commit this momentarily.

          Show
          Todd Lipcon added a comment - Just ran a sanity check where I disabled preallocation altogether. Average op time went from 9ms to 29ms on my disk (ext3, no special mount options, 7200RPM). I also verified that TestOfflineEditsViewer passes with the editsStored file that Aaron has attached here. I will commit this momentarily.
          Hide
          dhruba borthakur added a comment -

          +1

          Show
          dhruba borthakur added a comment - +1
          Hide
          Todd Lipcon added a comment -

          I ran tests on a non-SSD as well, and also added code to calculate the 99th percentile latency. Here are my results:

          Results for classic scheme:
          Overall total ops: 10000
          Overall total time of all ops: 103168.0
          Overall average time of op: 10.3168
          Overall fastest op: 5
          Overall slowest op: 400
          Preallocation total ops: 3
          Preallocation total time of all ops: 96.0
          Preallocation average time of op: 32.0
          Preallocation fastest op: 31
          Preallocation slowest op: 33
          Total time of slowest 1% of ops: 8462.0
          Average time of slowest 1% of ops: 84.62
          99th percentile latency: 39
          
          
          Results for new scheme:
          Overall total ops: 10000
          Overall total time of all ops: 91469.0
          Overall average time of op: 9.1469
          Overall fastest op: 2
          Overall slowest op: 341
          Preallocation total ops: 3
          Preallocation total time of all ops: 251.0
          Preallocation average time of op: 83.66666666666667
          Preallocation fastest op: 49
          Preallocation slowest op: 149
          Total time of slowest 1% of ops: 6787.0
          Average time of slowest 1% of ops: 67.87
          99th percentile latency: 25
          

          So, my results agree with Aaron's that this is an improvement for total throughput as well as 99th percentile, even though the preallocations take longer. If we find at some point that the preallocation is a performance issue, it wouldn't be too hard to push preallocation to a separate thread. But let's not add that complexity now, since this patch seems to be an overall improvement for durability as well as average performance.

          Show
          Todd Lipcon added a comment - I ran tests on a non-SSD as well, and also added code to calculate the 99th percentile latency. Here are my results: Results for classic scheme: Overall total ops: 10000 Overall total time of all ops: 103168.0 Overall average time of op: 10.3168 Overall fastest op: 5 Overall slowest op: 400 Preallocation total ops: 3 Preallocation total time of all ops: 96.0 Preallocation average time of op: 32.0 Preallocation fastest op: 31 Preallocation slowest op: 33 Total time of slowest 1% of ops: 8462.0 Average time of slowest 1% of ops: 84.62 99th percentile latency: 39 Results for new scheme: Overall total ops: 10000 Overall total time of all ops: 91469.0 Overall average time of op: 9.1469 Overall fastest op: 2 Overall slowest op: 341 Preallocation total ops: 3 Preallocation total time of all ops: 251.0 Preallocation average time of op: 83.66666666666667 Preallocation fastest op: 49 Preallocation slowest op: 149 Total time of slowest 1% of ops: 6787.0 Average time of slowest 1% of ops: 67.87 99th percentile latency: 25 So, my results agree with Aaron's that this is an improvement for total throughput as well as 99th percentile, even though the preallocations take longer. If we find at some point that the preallocation is a performance issue, it wouldn't be too hard to push preallocation to a separate thread. But let's not add that complexity now, since this patch seems to be an overall improvement for durability as well as average performance.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477602/hdfs-1846.3.patch
          against trunk revision 1097252.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477602/hdfs-1846.3.patch against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/424//console This message is automatically generated.
          Aaron T. Myers made changes -
          Attachment hdfs-1846.3.patch [ 12477602 ]
          Hide
          Aaron T. Myers added a comment -

          Re-uploading the patch, since the QA bot tried to apply the binary file as a patch.

          Show
          Aaron T. Myers added a comment - Re-uploading the patch, since the QA bot tried to apply the binary file as a patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477601/editsStored
          against trunk revision 1097252.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/423//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477601/editsStored against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/423//console This message is automatically generated.
          Aaron T. Myers made changes -
          Attachment hdfs-1846.3.patch [ 12477600 ]
          Attachment editsStored [ 12477601 ]
          Hide
          Aaron T. Myers added a comment -

          Updated patch which fixes the TestOfflineEditsViewer failure. Also included is an updated version of the binary file src/test/hdfs/org/apache/hadoop/hdfs/tools/offlineEditsViewer/editsStored, which is used for testing.

          The previous version of editsStored contained trailing 0x00s, and the TestOfflineEditsViewer test checked for the presence of these bytes. That wasn't really a reasonable test under the previous preallocation scheme since, as Dhruba points out, the JVM doesn't define the value of the data put in a file after seeking past the end of it.

          Show
          Aaron T. Myers added a comment - Updated patch which fixes the TestOfflineEditsViewer failure. Also included is an updated version of the binary file src/test/hdfs/org/apache/hadoop/hdfs/tools/offlineEditsViewer/editsStored , which is used for testing. The previous version of editsStored contained trailing 0x00s, and the TestOfflineEditsViewer test checked for the presence of these bytes. That wasn't really a reasonable test under the previous preallocation scheme since, as Dhruba points out, the JVM doesn't define the value of the data put in a file after seeking past the end of it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477595/hdfs-1846.2.patch
          against trunk revision 1097252.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477595/hdfs-1846.2.patch against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/422//console This message is automatically generated.
          Aaron T. Myers made changes -
          Attachment hdfs-1846.2.patch [ 12477595 ]
          Hide
          Aaron T. Myers added a comment -

          For some reason I can't explain, on the Hudson box `du' determined that the preallocated file was using 258 blocks, not 257 as it does on my box. This patch switches the test from checking == 257 to >= 257.

          Show
          Aaron T. Myers added a comment - For some reason I can't explain, on the Hudson box `du' determined that the preallocated file was using 258 blocks, not 257 as it does on my box. This patch switches the test from checking == 257 to >= 257.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477582/hdfs-1846.1.patch
          against trunk revision 1097252.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477582/hdfs-1846.1.patch against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/421//console This message is automatically generated.
          Aaron T. Myers made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Aaron T. Myers made changes -
          Attachment hdfs-1846.1.patch [ 12477582 ]
          Hide
          Aaron T. Myers added a comment -

          Patch which implements the new preallocation scheme I've described, and adds a test to ensure that the blocks are actually allocated on the underlying FS.

          Show
          Aaron T. Myers added a comment - Patch which implements the new preallocation scheme I've described, and adds a test to ensure that the blocks are actually allocated on the underlying FS.
          Hide
          Aaron T. Myers added a comment -

          @Eli - the first test was indeed done on an SSD. Here are the results of running the test on a spinning HDD:

          ----------------------------------------------------
          Results for classic scheme:
          Overall total ops: 100000
          Overall total time of all ops: 1024072.0
          Overall average time of op: 10.24072
          Overall fastest op: 3
          Overall slowest op: 178
          Preallocation total ops: 23
          Preallocation total time of all ops: 871.0
          Preallocation average time of op: 37.869565217391305
          Preallocation fastest op: 28
          Preallocation slowest op: 52
          Total time of slowest 1% of ops: 48949.0
          Average time of slowest 1% of ops: 48.949
          ----------------------------------------------------
          ----------------------------------------------------
          Results for new scheme:
          Overall total ops: 100000
          Overall total time of all ops: 860702.0
          Overall average time of op: 8.60702
          Overall fastest op: 2
          Overall slowest op: 288
          Preallocation total ops: 23
          Preallocation total time of all ops: 1236.0
          Preallocation average time of op: 53.73913043478261
          Preallocation fastest op: 41
          Preallocation slowest op: 91
          Total time of slowest 1% of ops: 36456.0
          Average time of slowest 1% of ops: 36.456
          ----------------------------------------------------
          

          The results are similar to my previous test, just a whole lot slower across the board. If anything the percent improvement for the average op seems to have improved - from 5% improvement on an SSD to 18% improvement on a normal HDD. The average performance degradation of a preallocation-inducing op has also improved - from 1200% worse to 42% worse.

          Also worth noting that, per an offline suggestion from Todd, I ran this test slightly differently. I ran each test (classic and new schemes) twice, to account for any warm-up time for the various caches involved (disk, JIT, classloading, local FS, etc.) The results I've included here are from the second run of each test. Here's a diff based off my previous patch:

          index 7e74429..d599224 100644
          --- src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogOutputStream.java
          +++ src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogOutputStream.java
          @@ -19,11 +19,13 @@ public class TestEditLogOutputStream {
             @Test
             public void testEditLogOutputStreamPerformanceWithClassicPreallocationScheme() throws IOException {
               performTestAndPrintResults(false);
          +    performTestAndPrintResults(false);
             }
             
             @Test
             public void testEditLogOutputStreamPerformanceWithNewPreallocationScheme() throws IOException {
               performTestAndPrintResults(true);
          +    performTestAndPrintResults(true);
             }
           
             private void performTestAndPrintResults(boolean useNewPreallocationScheme) throws IOException {
          @@ -32,6 +34,7 @@ public class TestEditLogOutputStream {
               
               Configuration conf = new Configuration();
               conf.set(DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY, "false");
          +    conf.set("hadoop.tmp.dir", "/data/1/atm/edits-log-preallocate-test/tmp");
               FileSystem.setDefaultUri(conf, "hdfs://localhost:0");
               conf.set("dfs.http.address", "127.0.0.1:0");
               File baseDir = new File(conf.get("hadoop.tmp.dir"), "dfs/");
          
          Show
          Aaron T. Myers added a comment - @Eli - the first test was indeed done on an SSD. Here are the results of running the test on a spinning HDD: ---------------------------------------------------- Results for classic scheme: Overall total ops: 100000 Overall total time of all ops: 1024072.0 Overall average time of op: 10.24072 Overall fastest op: 3 Overall slowest op: 178 Preallocation total ops: 23 Preallocation total time of all ops: 871.0 Preallocation average time of op: 37.869565217391305 Preallocation fastest op: 28 Preallocation slowest op: 52 Total time of slowest 1% of ops: 48949.0 Average time of slowest 1% of ops: 48.949 ---------------------------------------------------- ---------------------------------------------------- Results for new scheme: Overall total ops: 100000 Overall total time of all ops: 860702.0 Overall average time of op: 8.60702 Overall fastest op: 2 Overall slowest op: 288 Preallocation total ops: 23 Preallocation total time of all ops: 1236.0 Preallocation average time of op: 53.73913043478261 Preallocation fastest op: 41 Preallocation slowest op: 91 Total time of slowest 1% of ops: 36456.0 Average time of slowest 1% of ops: 36.456 ---------------------------------------------------- The results are similar to my previous test, just a whole lot slower across the board. If anything the percent improvement for the average op seems to have improved - from 5% improvement on an SSD to 18% improvement on a normal HDD. The average performance degradation of a preallocation-inducing op has also improved - from 1200% worse to 42% worse. Also worth noting that, per an offline suggestion from Todd, I ran this test slightly differently. I ran each test (classic and new schemes) twice, to account for any warm-up time for the various caches involved (disk, JIT, classloading, local FS, etc.) The results I've included here are from the second run of each test. Here's a diff based off my previous patch: index 7e74429..d599224 100644 --- src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogOutputStream.java +++ src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogOutputStream.java @@ -19,11 +19,13 @@ public class TestEditLogOutputStream { @Test public void testEditLogOutputStreamPerformanceWithClassicPreallocationScheme() throws IOException { performTestAndPrintResults( false ); + performTestAndPrintResults( false ); } @Test public void testEditLogOutputStreamPerformanceWithNewPreallocationScheme() throws IOException { performTestAndPrintResults( true ); + performTestAndPrintResults( true ); } private void performTestAndPrintResults( boolean useNewPreallocationScheme) throws IOException { @@ -32,6 +34,7 @@ public class TestEditLogOutputStream { Configuration conf = new Configuration(); conf.set(DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY, " false " ); + conf.set( "hadoop.tmp.dir" , "/data/1/atm/edits-log-preallocate-test/tmp" ); FileSystem.setDefaultUri(conf, "hdfs: //localhost:0" ); conf.set( "dfs.http.address" , "127.0.0.1:0" ); File baseDir = new File(conf.get( "hadoop.tmp.dir" ), "dfs/" );
          Hide
          Eli Collins added a comment -

          @atm - are these results from a system using an ssd? Would be interesting to see the same numbers on a spinning disk.

          Show
          Eli Collins added a comment - @atm - are these results from a system using an ssd? Would be interesting to see the same numbers on a spinning disk.
          Hide
          Nathan Roberts added a comment -

          If this change improves performance, then by all means I'm for it. However, I want to make sure we're not getting confused over why pre-allocation helps. The key about pre-allocation is that the inode is not getting dirtied on every append (because the size of the file is changing) which in-turn doubles the number of seeks during the sync operation. This was a huge win and is attained whether or not the blocks are truly pre-allocated or simply sparse. The indirect block pointers (in ext3 anyway) will also be dirtied frequently as we actually allocate blocks to the file, this will happen each time we cross a 4K boundary. Doing this all at once probably contributes to some of the improvement that Aaron measured.

          Show
          Nathan Roberts added a comment - If this change improves performance, then by all means I'm for it. However, I want to make sure we're not getting confused over why pre-allocation helps. The key about pre-allocation is that the inode is not getting dirtied on every append (because the size of the file is changing) which in-turn doubles the number of seeks during the sync operation. This was a huge win and is attained whether or not the blocks are truly pre-allocated or simply sparse. The indirect block pointers (in ext3 anyway) will also be dirtied frequently as we actually allocate blocks to the file, this will happen each time we cross a 4K boundary. Doing this all at once probably contributes to some of the improvement that Aaron measured.
          Hide
          Brock Noland added a comment -

          This is a major improvement to the existing preallocation scheme. If we decide the time to preallocate is too costly, then I think I the existing scheme should be scrapped. Sparse files are not preallocated. You can create a file which appears to be 500GB instantly by seeking past the end of the file.

          1. rm -f test; time dd if=/dev/null of=test seek=999999999; ls -lh test
            0+0 records in
            0+0 records out
            0 bytes (0 B) copied, 1.4e-05 seconds, 0.0 kB/s

          real 0m0.002s
          user 0m0.000s
          sys 0m0.002s
          rw-rr- 1 root root 477G Apr 24 21:04 test

          Show
          Brock Noland added a comment - This is a major improvement to the existing preallocation scheme. If we decide the time to preallocate is too costly, then I think I the existing scheme should be scrapped. Sparse files are not preallocated. You can create a file which appears to be 500GB instantly by seeking past the end of the file. rm -f test; time dd if=/dev/null of=test seek=999999999; ls -lh test 0+0 records in 0+0 records out 0 bytes (0 B) copied, 1.4e-05 seconds, 0.0 kB/s real 0m0.002s user 0m0.000s sys 0m0.002s rw-r r - 1 root root 477G Apr 24 21:04 test
          Hide
          Aaron T. Myers added a comment -

          Also, it occurs to me that the largest edit log operation we could potentially have (a rename of two 8,000 character paths) would be substantially larger on disk than the current threshold we use for determining when to do preallocation - 4kb. Thus, I propose that if we do decide to go the route I'm suggesting, we also up this threshold.

          Show
          Aaron T. Myers added a comment - Also, it occurs to me that the largest edit log operation we could potentially have (a rename of two 8,000 character paths) would be substantially larger on disk than the current threshold we use for determining when to do preallocation - 4kb. Thus, I propose that if we do decide to go the route I'm suggesting, we also up this threshold.
          Aaron T. Myers made changes -
          Attachment hdfs-1846-perf-analysis.0.patch [ 12477237 ]
          Hide
          Aaron T. Myers added a comment -

          Patch of the performance analysis I did to produce the results in the previous comment.

          Show
          Aaron T. Myers added a comment - Patch of the performance analysis I did to produce the results in the previous comment.
          Hide
          Aaron T. Myers added a comment -

          Hey guys, I've done some performance analysis, and here are the results. I'll post a patch shortly (not intended for inclusion) so you can see what I did to do the analysis. If anyone would like to try this patch on their own system, I'd be very curious to see the results, since as Nathan points out, the results can be affected by many factors.

          ----------------------------------------------------
          Results for classic scheme:
          Overall total ops: 100000 
          Overall total time of all ops: 39224.0
          Overall average time of op: 0.39224
          Overall fastest op: 0
          Overall slowest op: 223 
          Preallocation total ops: 23
          Preallocation total time of all ops: 24.0
          Preallocation average time of op: 1.0434782608695652
          Preallocation fastest op: 0
          Preallocation slowest op: 6
          Total time of slowest 1% of ops: 4858.0
          Average time of slowest 1% of ops: 4.858
          ----------------------------------------------------
          ----------------------------------------------------
          Results for new scheme: 
          Overall total ops: 100000
          Overall total time of all ops: 37192.0
          Overall average time of op: 0.37192
          Overall fastest op: 0 
          Overall slowest op: 231
          Preallocation total ops: 23 
          Preallocation total time of all ops: 291.0 
          Preallocation average time of op: 12.652173913043478
          Preallocation fastest op: 10 
          Preallocation slowest op: 21
          Total time of slowest 1% of ops: 4670.0
          Average time of slowest 1% of ops: 4.67
          ----------------------------------------------------
          

          I personally ran this test several times on my own system, and the results from this particular test run are pretty representative. There wasn't much variation across runs.

          As you can see from this data, with the new scheme, performing an edit which causes an on-disk preallocation is indeed slower - about 10x slower than a similar op using the previous scheme. However, I was correct that the time taken for the average op is indeed lower with the new scheme than the old. Also worth noting that the average time taken for the slowest 1% of ops is faster with the new scheme, since there were only 23 preallocations during the test run.

          I'm of the opinion that the increased latency of the preallocation-inducing ops is worth the performance improvement of the average op and the extra durability this patch would provide. The worst increase in latency from an op which happens to induce a preallocation is ~20ms, which seems acceptable.

          Also, curiously, in the course of this analysis I discovered that under both preallocation schemes there are fairly consistently ~10 ops whose total time taken was ~200ms on my system. These ops seem uncorrelated with preallocations. Determining what's causing those is being left as future work.

          Show
          Aaron T. Myers added a comment - Hey guys, I've done some performance analysis, and here are the results. I'll post a patch shortly (not intended for inclusion) so you can see what I did to do the analysis. If anyone would like to try this patch on their own system, I'd be very curious to see the results, since as Nathan points out, the results can be affected by many factors. ---------------------------------------------------- Results for classic scheme: Overall total ops: 100000 Overall total time of all ops: 39224.0 Overall average time of op: 0.39224 Overall fastest op: 0 Overall slowest op: 223 Preallocation total ops: 23 Preallocation total time of all ops: 24.0 Preallocation average time of op: 1.0434782608695652 Preallocation fastest op: 0 Preallocation slowest op: 6 Total time of slowest 1% of ops: 4858.0 Average time of slowest 1% of ops: 4.858 ---------------------------------------------------- ---------------------------------------------------- Results for new scheme: Overall total ops: 100000 Overall total time of all ops: 37192.0 Overall average time of op: 0.37192 Overall fastest op: 0 Overall slowest op: 231 Preallocation total ops: 23 Preallocation total time of all ops: 291.0 Preallocation average time of op: 12.652173913043478 Preallocation fastest op: 10 Preallocation slowest op: 21 Total time of slowest 1% of ops: 4670.0 Average time of slowest 1% of ops: 4.67 ---------------------------------------------------- I personally ran this test several times on my own system, and the results from this particular test run are pretty representative. There wasn't much variation across runs. As you can see from this data, with the new scheme, performing an edit which causes an on-disk preallocation is indeed slower - about 10x slower than a similar op using the previous scheme. However, I was correct that the time taken for the average op is indeed lower with the new scheme than the old. Also worth noting that the average time taken for the slowest 1% of ops is faster with the new scheme, since there were only 23 preallocations during the test run. I'm of the opinion that the increased latency of the preallocation-inducing ops is worth the performance improvement of the average op and the extra durability this patch would provide. The worst increase in latency from an op which happens to induce a preallocation is ~20ms, which seems acceptable. Also, curiously, in the course of this analysis I discovered that under both preallocation schemes there are fairly consistently ~10 ops whose total time taken was ~200ms on my system. These ops seem uncorrelated with preallocations. Determining what's causing those is being left as future work.
          Hide
          Nathan Roberts added a comment -

          I agree with Todd. We need to verify the latency is not impacted. The tricky part of this testing though is that the result can depend on lots of factors (disk subsystem on namenode, other I/O going on at the same time for whatever reason, etc.) I don't know if it makes sense or not but it would seem to me that if we had a way of validating each transaction was sane before processing it, that would be sufficient. Just a simple end-of-transaction magic cookie would probably be sufficient. I'm not sure actually pre-allocating the blocks actually gets us that much. In fact, it can hurt because we end up physically writing the blocks twice.

          Show
          Nathan Roberts added a comment - I agree with Todd. We need to verify the latency is not impacted. The tricky part of this testing though is that the result can depend on lots of factors (disk subsystem on namenode, other I/O going on at the same time for whatever reason, etc.) I don't know if it makes sense or not but it would seem to me that if we had a way of validating each transaction was sane before processing it, that would be sufficient. Just a simple end-of-transaction magic cookie would probably be sufficient. I'm not sure actually pre-allocating the blocks actually gets us that much. In fact, it can hurt because we end up physically writing the blocks twice.
          Hide
          Todd Lipcon added a comment -

          Only thing I'm a little nervous about with this fix is that it might affect 99th percentile of sync times, since now after every 1MB of writes, we have to actually write and sync an entire 1MB chunk. I'd be interested to see a comparison of the 99th percentile latencies for flushAndSync() before and after this change, just to make sure it's acceptible.

          Show
          Todd Lipcon added a comment - Only thing I'm a little nervous about with this fix is that it might affect 99th percentile of sync times, since now after every 1MB of writes, we have to actually write and sync an entire 1MB chunk. I'd be interested to see a comparison of the 99th percentile latencies for flushAndSync() before and after this change, just to make sure it's acceptible.
          Hide
          Todd Lipcon added a comment -

          open a separate JIRA to implement this using native code if fallocate is available

          I think this is easier said than done, because there's no particular way to check if it's actually "hooked up". So, it would have to be Yet Another Config, and doesn't particular seem worth it beyond what you're doing here.

          Show
          Todd Lipcon added a comment - open a separate JIRA to implement this using native code if fallocate is available I think this is easier said than done, because there's no particular way to check if it's actually "hooked up". So, it would have to be Yet Another Config, and doesn't particular seem worth it beyond what you're doing here.
          Hide
          Aaron T. Myers added a comment -

          That will be great, but I suspect that the jvm does not give you this feature. So, we might as well go with what Java gives us. At the least, the "pre-allocation" should allocate blocks for the file, so your patch looks godo to me.

          I agree - the JVM doesn't provide this feature.

          The other option would be to use JNI to fallocate(2). I don't believe this is available on RHEL5 or ext3, though. Perhaps we should go with what I currently have, and open a separate JIRA to implement this using native code if fallocate is available?

          I'll do a little work to fix up this patch and add a test.

          Show
          Aaron T. Myers added a comment - That will be great, but I suspect that the jvm does not give you this feature. So, we might as well go with what Java gives us. At the least, the "pre-allocation" should allocate blocks for the file, so your patch looks godo to me. I agree - the JVM doesn't provide this feature. The other option would be to use JNI to fallocate(2) . I don't believe this is available on RHEL5 or ext3, though. Perhaps we should go with what I currently have, and open a separate JIRA to implement this using native code if fallocate is available? I'll do a little work to fix up this patch and add a test.
          Hide
          dhruba borthakur added a comment -

          > Is your point that we should figure out some other way to make the underlying FS allocate blocks

          That will be great, but I suspect that the jvm does not give you this feature. So, we might as well go with what Java gives us. At the least, the "pre-allocation" should allocate blocks for the file, so your patch looks godo to me.

          Show
          dhruba borthakur added a comment - > Is your point that we should figure out some other way to make the underlying FS allocate blocks That will be great, but I suspect that the jvm does not give you this feature. So, we might as well go with what Java gives us. At the least, the "pre-allocation" should allocate blocks for the file, so your patch looks godo to me.
          Aaron T. Myers made changes -
          Field Original Value New Value
          Attachment hdfs-1846.0.txt [ 12477023 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks for the comments, Dhruba. Is your point that we should figure out some other way to make the underlying FS allocate blocks, besides deliberately initializing the blocks? If so, I should probably rename this JIRA to be "make sure the preallocation of edits logs actually allocates blocks on disk."

          I'm quite sure (at least on ext3 on Ubuntu Maverick) that the edits file does indeed end up being sparse with the current implementation:

          [10:29:51] atm@simon:~/src/apache/hadoop.git$ ls -l name-current.sparse/edits 
          -rw-r--r-- 1 atm atm 1049092 2011-04-21 10:25 name-current.sparse/edits
          [10:30:53] atm@simon:~/src/apache/hadoop.git$ du -B1 !$
          du -B1 name-current.sparse/edits
          32768	name-current.sparse/edits
          [10:30:57] atm@simon:~/src/apache/hadoop.git$ ls -l dfs/name/current/edits 
          -rw-r--r-- 1 atm atm 1048580 2011-04-21 10:26 dfs/name/current/edits
          [10:31:03] atm@simon:~/src/apache/hadoop.git$ du -B1 !$
          du -B1 dfs/name/current/edits
          1052672	dfs/name/current/edits
          

          Where "name-current.sparse" contains a copy of a dfs.name.dir/current directory created with the current code, and "dfs/name/current" is a name directory created with modified code which fills the preallocation space with OP_INVALID.

          I've attached a patch of the code changes I made to do this test. Not intended for commit.

          Show
          Aaron T. Myers added a comment - Thanks for the comments, Dhruba. Is your point that we should figure out some other way to make the underlying FS allocate blocks, besides deliberately initializing the blocks? If so, I should probably rename this JIRA to be "make sure the preallocation of edits logs actually allocates blocks on disk." I'm quite sure (at least on ext3 on Ubuntu Maverick) that the edits file does indeed end up being sparse with the current implementation: [10:29:51] atm@simon:~/src/apache/hadoop.git$ ls -l name-current.sparse/edits -rw-r--r-- 1 atm atm 1049092 2011-04-21 10:25 name-current.sparse/edits [10:30:53] atm@simon:~/src/apache/hadoop.git$ du -B1 !$ du -B1 name-current.sparse/edits 32768 name-current.sparse/edits [10:30:57] atm@simon:~/src/apache/hadoop.git$ ls -l dfs/name/current/edits -rw-r--r-- 1 atm atm 1048580 2011-04-21 10:26 dfs/name/current/edits [10:31:03] atm@simon:~/src/apache/hadoop.git$ du -B1 !$ du -B1 dfs/name/current/edits 1052672 dfs/name/current/edits Where " name-current.sparse " contains a copy of a dfs.name.dir/current directory created with the current code, and " dfs/name/current " is a name directory created with modified code which fills the preallocation space with OP_INVALID . I've attached a patch of the code changes I made to do this test. Not intended for commit.
          Hide
          dhruba borthakur added a comment -

          The idea of preallocation is to somehow make the underlying filessytem allocate disk blocks for the file without initializing them. The JVM specification says that the data in this area is "undefined".. specifically for the case when it might use an API to the underling system to allocate blocks but not initialize their contents.

          Show
          dhruba borthakur added a comment - The idea of preallocation is to somehow make the underlying filessytem allocate disk blocks for the file without initializing them. The JVM specification says that the data in this area is "undefined".. specifically for the case when it might use an API to the underling system to allocate blocks but not initialize their contents.
          Aaron T. Myers created issue -

            People

            • Assignee:
              Aaron T. Myers
              Reporter:
              Aaron T. Myers
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development