[HDFS-3023] Optimize entries in edits log for persistBlocks calls - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: HA branch (HDFS-1623), 0.23.2
Fix Version/s: HA branch (HDFS-1623)
Component/s: namenode, performance
Labels:
None

Target Version/s:

HA branch (HDFS-1623)
Hadoop Flags:

Reviewed

Description

One of the performance issues noticed in the HA branch is due to the much larger edit logs, now that we are writing OP_ADD transactions to the edit log on every block allocation. We can condense these calls down in two ways:
1) use variable-length integers for the block list length, size, and genstamp (most of these end up fitting in far less than 8 bytes)
2) use delta-coding for the genstamp and block size for any blocks after the first block (most blocks will be the same size and only slightly higher genstamps)
3) introduce a new OP_UPDATE_BLOCKS transaction that doesn't re-serialize metadata information like lease owner, permissions, etc
4) allow OP_UPDATE_BLOCKS to only re-serialize the blocks that have changed for a given transaction

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hdfs-3023-HDFS-1623.txt
28/Feb/12 05:18
49 kB
Todd Lipcon
hdfs-3023.txt
01/Mar/12 00:21
50 kB
Todd Lipcon

Issue Links

is related to

HDFS-3010 Get performance on HA branch to match trunk

Resolved

Activity

Ascending order - Click to sort in descending order

Todd Lipcon added a comment - 28/Feb/12 04:59

Apparently I am bad at counting. The above should read "in four ways"

I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.

Todd Lipcon added a comment - 28/Feb/12 04:59 Apparently I am bad at counting. The above should read "in four ways" I ran a teragen with 580 mappers and a 4MB block size on the HA branch with optimizations 1 and 2 in place and it reduced the size of the edit log from 1.2GB to ~600MB (factor of two savings). This benchmark is designed to log a lot of persistBlocks() calls, so the savings in less strenuous use cases won't be as large, but still worth doing.

Todd Lipcon added a comment - 28/Feb/12 05:18

Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work.

The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch.

I'll do a round of benchmarks on this patch tomorrow.

Todd Lipcon added a comment - 28/Feb/12 05:18 Attached patch implements optimizations 1-3 above. I didn't do optimization #4 since it's a bit more complicated and will only really help with files that are several blocks long. I'd like to leave it for future work. The patch is against the HA branch, since the branch is soon to be merged and the code around OP_ADD, etc, differs a bit. Rather than do it twice, I figured I'd just work on HA branch. I'll do a round of benchmarks on this patch tomorrow.

Eli Collins added a comment - 28/Feb/12 17:21

Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the ~~HDFS-3010~~ for real workloads is option #2 vs option #3.

Eli Collins added a comment - 28/Feb/12 17:21 Is log size a proxy for performance? Ie do we have data to suggest the slowdown is due to the log size (# of bytes we're writing) vs just an increase in the # of transactions, which may increase eg the # flushes, or have you isolated for those already (eg there are no net new flushes) and it's down to just the # bytes going across the wire? I guess another way to phrase this is how much of the overhead in the HDFS-3010 for real workloads is option #2 vs option #3.

Todd Lipcon added a comment - 28/Feb/12 17:28

I do think the log size is the major difference causing the performance issues in ~~HDFS-3010~~. In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today.

Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)

Todd Lipcon added a comment - 28/Feb/12 17:28 I do think the log size is the major difference causing the performance issues in HDFS-3010 . In my initial benchmarks last night (which included only one of the above optimizations) the regression was only 7% vs the ~20% I had seen in earlier testing. But, these results aren't reliable since I also switched to a different cluster between benchmarks. I'll rerun the benchmarks today. Also worth noting - you referred to "option #2 vs option #3" above, but really option #3 is combined with option #2 for extra savings (ie the OP_UPDATE_BLOCKS is around half to 1/3 the size because of the delta-vint-encoded genstamps and sizes)

Eli Collins added a comment - 28/Feb/12 17:31

The patch looks good btw, though why does updating block info also now update mod time and access time?

Eli Collins added a comment - 28/Feb/12 17:31 The patch looks good btw, though why does updating block info also now update mod time and access time?

Eli Collins added a comment - 28/Feb/12 17:33

Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in ~~HDFS-3010~~, ie is most of the overhead is due to component #3 (which this patch does not address).

Eli Collins added a comment - 28/Feb/12 17:33 Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address).

Eli Collins added a comment - 28/Feb/12 17:33

Thanks for the info. By option #2 vs option #3 above I meant component #2 vs component #3 in ~~HDFS-3010~~, ie is most of the overhead is due to component #3 (which this patch does not address).

Todd Lipcon added a comment - 28/Feb/12 17:33

though why does updating block info also now update mod time and access time

I just moved the following code:

-    // Update the salient file attributes.
-    file.setAccessTime(addCloseOp.atime);
-    file.setModificationTimeForce(addCloseOp.mtime);

since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.

Todd Lipcon added a comment - 28/Feb/12 17:33 though why does updating block info also now update mod time and access time I just moved the following code: - // Update the salient file attributes. - file.setAccessTime(addCloseOp.atime); - file.setModificationTimeForce(addCloseOp.mtime); since OP_ADD contains those pieces of info, but OP_UPDATE_BLOCKS doesn't.

Todd Lipcon added a comment - 28/Feb/12 17:36

By option #2 vs option #3 above I meant component #2 vs component #3 in ~~HDFS-3010~~, ie is most of the overhead is due to component #3 (which this patch does not address).

Ah, I understand now.

Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity.

The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.

Todd Lipcon added a comment - 28/Feb/12 17:36 By option #2 vs option #3 above I meant component #2 vs component #3 in HDFS-3010 , ie is most of the overhead is due to component #3 (which this patch does not address). Ah, I understand now. Unfortunately the fsync cost is a "fixed penalty" of sorts. But I think the fsync will cost less when the data is smaller, especially when logging to NFS. Granted there is a fixed cost for seek, but there isn't much we can do to optimize that, short of flash or BBU write cache – so everywhere else we can shave off some overhead we should take the opportunity. The good thing about the sync cost is that it might harm the performance of individual writers, but since syncs are batched, total system throughput shouldn't go down any more than the individual writer throughput. Will continue to investigate with further benchmarking.

Eli Collins added a comment - 28/Feb/12 18:34

Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates.

Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change.

Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?

Eli Collins added a comment - 28/Feb/12 18:34 Thanks for the info wrt perf, makes sense. Will be interesting to see which component dominates. Change wrt mod/access time makes sense now. Seems weird to set these on a close op, but that's independent of this change. Wrt the TODO, could we possibly reach this case via seek then append (iirc that's disallowed)? I was thinking truncate might be another option but that will throw away the blocks so shouldn't get here. I agree that we should only need this for the last block, maybe keep the code as is but add an assert?

Todd Lipcon added a comment - 28/Feb/12 18:41

I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.

Todd Lipcon added a comment - 28/Feb/12 18:41 I'll file a follow-up JIRA about that TODO I added. I think the scenario could be seen if you write several blocks with an earlier version of HDFS, then fsync() – you'll see an OP_ADD that adds a bunch of blocks at once, only the last of which should be marked UnderConstruction.

Todd Lipcon added a comment - 29/Feb/12 01:27

I've been running this patch on a cluster. With this patch combined with ~~HDFS-3020~~ and HDFS-3025, the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.

Todd Lipcon added a comment - 29/Feb/12 01:27 I've been running this patch on a cluster. With this patch combined with HDFS-3020 and HDFS-3025 , the performance of the 4MB-block teragen is very close to trunk's (where we don't log persistBlocks at all). I've also tested plenty of restarts to verify the loading code.

Eli Collins added a comment - 29/Feb/12 22:30

Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.

Eli Collins added a comment - 29/Feb/12 22:30 Hey Todd, reviewed the patch again. +1, thanks for all the good benchmarking.

Todd Lipcon added a comment - 01/Mar/12 00:21

Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.

Todd Lipcon added a comment - 01/Mar/12 00:21 Previous rev conflicted with the patch on trunk which added toString() to all the ops. New rev adds toString(). Since it's a trivial change, I'll commit this based on the prior +1.

Todd Lipcon added a comment - 01/Mar/12 00:37

Committed to HA branch.

Todd Lipcon added a comment - 01/Mar/12 00:37 Committed to HA branch.

Hudson added a comment - 01/Mar/12 12:57

Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/)
~~HDFS-3023~~. Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356)

Result = UNSTABLE
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356
Files :

/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/CHANGES.~~HDFS-1623~~.txt
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml

Hudson added a comment - 01/Mar/12 12:57 Integrated in Hadoop-Hdfs-HAbranch-build #93 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/93/ ) HDFS-3023 . Optimize entries in edits log for persistBlocks call. Contributed by Todd Lipcon. (Revision 1295356) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295356 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOpCodes.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageSerialization.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsElement.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/EditsLoaderCurrent.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/resources/editsStored.xml

Todd Lipcon added a comment - 01/Mar/12 18:21

Accidentally missed one place to add the new LV:

diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
index 8960cbc..fdc9892 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java
@@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader {
   protected final DateFormat dateFormat = 
                                       new SimpleDateFormat("yyyy-MM-dd HH:mm");
   private static int[] versions = { -16, -17, -18, -19, -20, -21, -22, -23,
-      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39};
+      -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39,
+      -40};
   private int imageVersion = 0;
 
   /* (non-Javadoc)

Cool if I just commit this delta under the same JIRA?

Todd Lipcon added a comment - 01/Mar/12 18:21 Accidentally missed one place to add the new LV: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java index 8960cbc..fdc9892 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java @@ -122,7 +122,8 @@ class ImageLoaderCurrent implements ImageLoader { protected final DateFormat dateFormat = new SimpleDateFormat( "yyyy-MM-dd HH:mm" ); private static int [] versions = { -16, -17, -18, -19, -20, -21, -22, -23, - -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39}; + -24, -25, -26, -27, -28, -30, -31, -32, -33, -34, -35, -36, -37, -38, -39, + -40}; private int imageVersion = 0; /* (non-Javadoc) Cool if I just commit this delta under the same JIRA?

Aaron Myers added a comment - 01/Mar/12 18:22

+1 for the delta.

Aaron Myers added a comment - 01/Mar/12 18:22 +1 for the delta.

Todd Lipcon added a comment - 01/Mar/12 18:23

Committed the delta (r1295721)

Todd Lipcon added a comment - 01/Mar/12 18:23 Committed the delta (r1295721)

Hudson added a comment - 01/Mar/12 20:28

Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/)
Amend ~~HDFS-3023~~. Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721)

Result = UNSTABLE
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721
Files :

/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java

Hudson added a comment - 01/Mar/12 20:28 Integrated in Hadoop-Hdfs-HAbranch-build #94 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/94/ ) Amend HDFS-3023 . Add new layout version number to ImageLoaderCurrent to fix TestOfflineImageViewer. (Revision 1295721) Result = UNSTABLE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295721 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/ImageLoaderCurrent.java

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Feb/12 04:57

Updated:: 01/Mar/12 20:28

Resolved:: 01/Mar/12 00:37

Hadoop HDFS

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates