Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3023

Optimize entries in edits log for persistBlocks calls

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • HA branch (HDFS-1623), 0.23.2
    • HA branch (HDFS-1623)
    • namenode, performance
    • None

    Description

      One of the performance issues noticed in the HA branch is due to the much larger edit logs, now that we are writing OP_ADD transactions to the edit log on every block allocation. We can condense these calls down in two ways:
      1) use variable-length integers for the block list length, size, and genstamp (most of these end up fitting in far less than 8 bytes)
      2) use delta-coding for the genstamp and block size for any blocks after the first block (most blocks will be the same size and only slightly higher genstamps)
      3) introduce a new OP_UPDATE_BLOCKS transaction that doesn't re-serialize metadata information like lease owner, permissions, etc
      4) allow OP_UPDATE_BLOCKS to only re-serialize the blocks that have changed for a given transaction

      Attachments

        1. hdfs-3023-HDFS-1623.txt
          49 kB
          Todd Lipcon
        2. hdfs-3023.txt
          50 kB
          Todd Lipcon

        Issue Links

          Activity

            tlipcon Todd Lipcon added a comment -

            Apparently I am bad at counting. The above should read "in four ways"

            I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.

            tlipcon Todd Lipcon added a comment - Apparently I am bad at counting. The above should read "in four ways" I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.
            tlipcon Todd Lipcon added a comment -

            Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work.

            The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch.

            I'll do a round of benchmarks on this patch tomorrow.

            tlipcon Todd Lipcon added a comment - Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work. The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch. I'll do a round of benchmarks on this patch tomorrow.
            eli2 Eli Collins added a comment -

            Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the HDFS-3010 for real workloads is option #2 vs option #3.

            eli2 Eli Collins added a comment - Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the HDFS-3010 for real workloads is option #2 vs option #3.
            tlipcon Todd Lipcon added a comment -

            I do think the log size is the major difference causing the performance issues in HDFS-3010. In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today.

            Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)

            tlipcon Todd Lipcon added a comment - I do think the log size is the major difference causing the performance issues in HDFS-3010 . In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today. Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)
            eli2 Eli Collins added a comment -

            The patch looks good btw, though why does updating block info also now update mod time and access time?

            eli2 Eli Collins added a comment - The patch looks good btw, though why does updating block info also now update mod time and access time?
            eli2 Eli Collins added a comment -

            Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

            eli2 Eli Collins added a comment - Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address).
            eli2 Eli Collins added a comment -

            Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

            eli2 Eli Collins added a comment - Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address).
            tlipcon Todd Lipcon added a comment -

            though why does updating block info also now update mod time and access time

            I just moved the following code:

            -    // Update the salient file attributes.
            -    file.setAccessTime(addCloseOp.atime);
            -    file.setModificationTimeForce(addCloseOp.mtime);
            

            since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.

            tlipcon Todd Lipcon added a comment - though why does updating block info also now update mod time and access time I just moved the following code: - // Update the salient file attributes. - file.setAccessTime(addCloseOp.atime); - file.setModificationTimeForce(addCloseOp.mtime); since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.
            tlipcon Todd Lipcon added a comment -

            By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010, ie is most of the overhead is due to component #3 (which this patch does not address).

            Ah, I understand now.

            Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity.

            The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.

            tlipcon Todd Lipcon added a comment - By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address). Ah, I understand now. Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity. The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.
            eli2 Eli Collins added a comment -

            Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates.

            Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change.

            Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?

            eli2 Eli Collins added a comment - Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates. Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change. Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?
            tlipcon Todd Lipcon added a comment -

            I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.

            tlipcon Todd Lipcon added a comment - I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.
            tlipcon Todd Lipcon added a comment -

            I've been running this patch on a cluster. With this patch combined with HDFS-3020 and HDFS-3025, the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.

            tlipcon Todd Lipcon added a comment - I've been running this patch on a cluster. With this patch combined with HDFS-3020 and HDFS-3025 , the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.
            eli2 Eli Collins added a comment -

            Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.

            eli2 Eli Collins added a comment - Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.
            tlipcon Todd Lipcon added a comment -

            Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.

            tlipcon Todd Lipcon added a comment - Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.
            tlipcon Todd Lipcon added a comment -

            Committed to HA branch.

            tlipcon Todd Lipcon added a comment - Committed to HA branch.
            hudson Hudson added a comment -

            Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/)
            HDFS-3023. Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356)

            Result = UNSTABLE
            todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356
            Files :

            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored
            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml
            hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/ ) HDFS-3023 . Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml
            tlipcon Todd Lipcon added a comment -

            Accidentally missed one place to add the new LV:

            diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
            index 8960cbc..fdc9892 100644
            --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
            +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
            @@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader {
               protected final DateFormat dateFormat = 
                                                   new SimpleDateFormat("yyyy-MM-dd HH:mm");
               private static int[] versions = { -16, -17, -18, -19, -20, -21, -22, -23,
            -      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39};
            +      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39,
            +      -40};
               private int imageVersion = 0;
             
               /* (non-Javadoc)
            

            Cool if I just commit this delta under the same JIRA?

            tlipcon Todd Lipcon added a comment - Accidentally missed one place to add the new LV: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java index 8960cbc..fdc9892 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java @@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader { protected final DateFormat dateFormat = new SimpleDateFormat( "yyyy-MM-dd HH:mm" ); private static int [] versions = { -16, -17, -18, -19, -20, -21, -22, -23, - -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39}; + -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39, + -40}; private int imageVersion = 0; /* (non-Javadoc) Cool if I just commit this delta under the same JIRA?
            atm Aaron Myers added a comment -

            +1 for the delta.

            atm Aaron Myers added a comment - +1 for the delta.
            tlipcon Todd Lipcon added a comment -

            Committed the delta (r1295721)

            tlipcon Todd Lipcon added a comment - Committed the delta (r1295721)
            hudson Hudson added a comment -

            Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/)
            Amend HDFS-3023. Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721)

            Result = UNSTABLE
            todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721
            Files :

            • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
            hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/ ) Amend HDFS-3023 . Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java

            People

              tlipcon Todd Lipcon
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: