Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3023

Optimize entries in edits log for persistBlocks calls

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623), 0.23.2
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: namenode, performance
    • Labels:
      None

      Description

      One of the performance issues noticed in the HA branch is due to the much larger edit logs, now that we are writing OP_ADD transactions to the edit log on every block allocation. We can condense these calls down in two ways:
      1) use variable-length integers for the block list length, size, and genstamp (most of these end up fitting in far less than 8 bytes)
      2) use delta-coding for the genstamp and block size for any blocks after the first block (most blocks will be the same size and only slightly higher genstamps)
      3) introduce a new OP_UPDATE_BLOCKS transaction that doesn't re-serialize metadata information like lease owner, permissions, etc
      4) allow OP_UPDATE_BLOCKS to only re-serialize the blocks that have changed for a given transaction

      1. hdfs-3023.txt
        50 kB
        Todd Lipcon
      2. hdfs-3023-HDFS-1623.txt
        49 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Apparently I am bad at counting. The above should read "in four ways"

          I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.

          Show
          Todd Lipcon added a comment - Apparently I am bad at counting. The above should read "in four ways" I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.
          Hide
          Todd Lipcon added a comment -

          Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work.

          The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch.

          I'll do a round of benchmarks on this patch tomorrow.

          Show
          Todd Lipcon added a comment - Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work. The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch. I'll do a round of benchmarks on this patch tomorrow.
          Hide
          Eli Collins added a comment -

          Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the HDFS-3010 for real workloads is option #2 vs option #3.

          Show
          Eli Collins added a comment - Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the HDFS-3010 for real workloads is option #2 vs option #3.
          Hide
          Todd Lipcon added a comment -

          I do think the log size is the major difference causing the performance issues in HDFS-3010. In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today.

          Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)

          Show
          Todd Lipcon added a comment - I do think the log size is the major difference causing the performance issues in HDFS-3010 . In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today. Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)
          Hide
          Eli Collins added a comment -

          The patch looks good btw, though why does updating block info also now update mod time and access time?

          Show
          Eli Collins added a comment - The patch looks good btw, though why does updating block info also now update mod time and access time?
          Hide
          Eli Collins added a comment -

          Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

          Show
          Eli Collins added a comment - Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address).
          Hide
          Eli Collins added a comment -

          Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

          Show
          Eli Collins added a comment - Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address).
          Hide
          Todd Lipcon added a comment -

          though why does updating block info also now update mod time and access time

          I just moved the following code:

          -    // Update the salient file attributes.
          -    file.setAccessTime(addCloseOp.atime);
          -    file.setModificationTimeForce(addCloseOp.mtime);
          

          since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.

          Show
          Todd Lipcon added a comment - though why does updating block info also now update mod time and access time I just moved the following code: - // Update the salient file attributes. - file.setAccessTime(addCloseOp.atime); - file.setModificationTimeForce(addCloseOp.mtime); since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.
          Hide
          Todd Lipcon added a comment -

          By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

          Ah, I understand now.

          Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity.

          The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.

          Show
          Todd Lipcon added a comment - By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address). Ah, I understand now. Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity. The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.
          Hide
          Eli Collins added a comment -

          Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates.

          Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change.

          Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?

          Show
          Eli Collins added a comment - Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates. Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change. Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?
          Hide
          Todd Lipcon added a comment -

          I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.

          Show
          Todd Lipcon added a comment - I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.
          Hide
          Todd Lipcon added a comment -

          I've been running this patch on a cluster. With this patch combined with HDFS-3020 and HDFS-3025, the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.

          Show
          Todd Lipcon added a comment - I've been running this patch on a cluster. With this patch combined with HDFS-3020 and HDFS-3025 , the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.
          Hide
          Eli Collins added a comment -

          Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.

          Show
          Eli Collins added a comment - Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.
          Hide
          Todd Lipcon added a comment -

          Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.

          Show
          Todd Lipcon added a comment - Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.
          Hide
          Todd Lipcon added a comment -

          Committed to HA branch.

          Show
          Todd Lipcon added a comment - Committed to HA branch.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/)
          HDFS-3023. Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356)

          Result = UNSTABLE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/ ) HDFS-3023 . Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml
          Hide
          Todd Lipcon added a comment -

          Accidentally missed one place to add the new LV:

          diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
          index 8960cbc..fdc9892 100644
          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
          @@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader {
             protected final DateFormat dateFormat = 
                                                 new SimpleDateFormat("yyyy-MM-dd HH:mm");
             private static int[] versions = { -16, -17, -18, -19, -20, -21, -22, -23,
          -      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39};
          +      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39,
          +      -40};
             private int imageVersion = 0;
           
             /* (non-Javadoc)
          

          Cool if I just commit this delta under the same JIRA?

          Show
          Todd Lipcon added a comment - Accidentally missed one place to add the new LV: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java index 8960cbc..fdc9892 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java @@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader { protected final DateFormat dateFormat = new SimpleDateFormat( "yyyy-MM-dd HH:mm" ); private static int [] versions = { -16, -17, -18, -19, -20, -21, -22, -23, - -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39}; + -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39, + -40}; private int imageVersion = 0; /* (non-Javadoc) Cool if I just commit this delta under the same JIRA?
          Hide
          Aaron T. Myers added a comment -

          +1 for the delta.

          Show
          Aaron T. Myers added a comment - +1 for the delta.
          Hide
          Todd Lipcon added a comment -

          Committed the delta (r1295721)

          Show
          Todd Lipcon added a comment - Committed the delta (r1295721)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/)
          Amend HDFS-3023. Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721)

          Result = UNSTABLE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/ ) Amend HDFS-3023 . Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development