Details
Description
Through we passed SYNC_FILE_RANGE_WRITE to sync_file_range, to make it as asynchronous as possible, it still could be blocked, e.g. the os io request queue is full.
Since we use sync_file_range just as a page cache advisor role it doesn't decide or guarantee the real durability, it would be nice if we could run it in backgroud. At least my test log showed, a few sync_file_range calls still cost tens of ms or more, due to the happened location is in the critical write path(BlockReceiver class), from a upper view, like HBase application, will "hung" tens of ms as well during Hlog syncing.
Generally speaking, the patch could not improve too much, but, better than before, right ?
Attachments
Attachments
- HDFS-6109-v4.txt
- 9 kB
- Liang Xie
- HDFS-6109-v3.txt
- 12 kB
- Liang Xie
- HDFS-6109-v2.txt
- 9 kB
- Liang Xie
- HDFS-6109.txt
- 9 kB
- Liang Xie
Activity
If the call to sync_file_range becomes asynchronous, then it can fall arbitrarily behind over time. This will get rid of one of the advantages of sync_file_range, which is that it makes performance more predictable. So I think if we do this, it should be optional.
yeh, nice suggestion, especially before i make patch, haha. let me add a option per your advice.
Attached is a straight-forward trunk patch, please help to review, thanks.
I introduced a config per comment and the default behivor keeps the same as before.
This change is only meaningful for latency sensitive application like HBase. Every write stall from datanode layer will let HBase WAL sync operation hung the similar time range at least.
I found this post from one fb mysql developer is really nice:
http://yoshinorimatsunobu.blogspot.hk/2014/03/how-syncfilerange-really-works.html
Especially the final "Solution for the stalls" section:
calling sync_file_range() from background threads (not from user-facing thread) are important
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12646767/HDFS-6109.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6975//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6975//console
This message is automatically generated.
Thanks for the blog post pointer. Pertinent.
Nit: Swap ofer on this if/else + if (!syncBehindWritesInBackground) { It is harder to read NOT true.
Its a big of a pain running a pool of threads just to do sync_file_range (that is supposed to be able to run asynchronously – except it often doesn't as per the blog posting) but this is better than our having big pauses.
+1 on patch. This facility is off by default and doesn't cost when off. A test that verifies the executor runs and calls sync_file_range when this facility is enabled would be a little pedantic IMO. What do others think?
Nice one xieliang007
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12646833/HDFS-6109-v2.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6979//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6979//console
This message is automatically generated.
Maybe we can reuse the FsDatasetAsyncDiskService available in FsDatasetImpl? Right now it's used to do async deletes, but it seems apt to do this sort of thing there too. It'd be good so we don't overload a single volume with different types of I/O work.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12647310/HDFS-6109-v3.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7003//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7003//console
This message is automatically generated.
nit: Nothing wrong w/ the below but I had a bit of a visceral reaction to the duplicated fat call to syncFileRangeIfPossible. Any way of making it 'neater' (set a boolean if we need to call submitSyncFileRangeRequest?
+ if (syncBehindWritesInBackground) { + FsVolumeSpi volume = this.datanode.getFSDataset().getVolume(block); + if (volume instanceof FsVolumeImpl) { + FsDatasetAsyncDiskService asyncDiskService = this.datanode + .getFSDataset().getFsDatasetAsyncDiskService(); + asyncDiskService.submitSyncFileRangeRequest( + (FsVolumeImpl) volume, outFd, lastCacheManagementOffset, + offsetInBlock - lastCacheManagementOffset, + NativeIO.POSIX.SYNC_FILE_RANGE_WRITE); + } else { + LOG.warn("Fallback to foreground sync_file_range, unexpected volume type:" + + volume.getClass().getSimpleName()); + NativeIO.POSIX.syncFileRangeIfPossible(outFd, + lastCacheManagementOffset, offsetInBlock + - lastCacheManagementOffset, + NativeIO.POSIX.SYNC_FILE_RANGE_WRITE); + } + } else { + NativeIO.POSIX.syncFileRangeIfPossible(outFd, + lastCacheManagementOffset, + offsetInBlock - lastCacheManagementOffset, + NativeIO.POSIX.SYNC_FILE_RANGE_WRITE); + }
Maybe change the class comment on FsDatasetAsyncDiskService now it is no longer about deletes only?
Both of above are nits.
Patch lgtm.
Index: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java =================================================================== --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java (revision 1598208) +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java (working copy) @@ -49,7 +49,7 @@ * It uses the {[~alanlink]FsDatasetImpl} object for synchronization. */ @InterfaceAudience.Private -class FsVolumeImpl implements FsVolumeSpi { +public class FsVolumeImpl implements FsVolumeSpi {
I think, rather than exposing the implementation classes and doing typecasts, we should just add a method to FsDatasetSpi like submitBackgroundSyncFileRangeRequest. Then FsDatasetImpl can add a method implementing this, and other implementations can choose to do something (or nothing) as they like.
Looks good aside from that. It would be nice to see some performance numbers.
Patch v4 should address Michael and Colin's comments.
for perf number, we could not expect it'll improve p99 or p99.9 latency too much, but i did observe a small part of the HBase WAL sync stalls were contributed by sync_file_range with adding more logging around it.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12647526/HDFS-6109-v4.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7012//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7012//console
This message is automatically generated.
Patch lgtm (I liked the cmccabe suggestion that we keep FsDatasetImpl encapsulated).
Committed to branch-2 and to trunk. Thanks for the patch xieliang007
SUCCESS: Integrated in Hadoop-trunk-Commit #5643 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5643/)
HDFS-6109 let sync_file_range() system call run in background (Liang Xie via stack) (stack: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1599347)
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetAsyncDiskService.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
FAILURE: Integrated in Hadoop-Yarn-trunk #572 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/572/)
HDFS-6109 let sync_file_range() system call run in background (Liang Xie via stack) (stack: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1599347)
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetAsyncDiskService.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
SUCCESS: Integrated in Hadoop-Hdfs-trunk #1763 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1763/)
HDFS-6109 let sync_file_range() system call run in background (Liang Xie via stack) (stack: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1599347)
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetAsyncDiskService.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk #1790 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1790/)
HDFS-6109 let sync_file_range() system call run in background (Liang Xie via stack) (stack: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1599347)
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetAsyncDiskService.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
xieliang007 I was going over the impact of enabling the 'dfs.datanode.sync.behind.writes.in.background' flag and thought about conditions under sync_file_range with SYNC_FILE_RANGE_WRITE might block. The sync_file_range man page says exactly what you alluded to in the ticket description:
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range
which are not presently submitted write-out. Note that even
this may block if you attempt to write more than request queue
size.
Would u please elaborate on the conditions where sync_file_range was blocking. What were the memory, disk, number and size of the regions on the region server, etc. I want to understand the signs to look for before enabling the 'dfs.datanode.sync.behind.writes.in.background' flag.
Thanx,
- Young
I will upload this minor patch after HDFS-6108 done, i hate rebase...