Details
Description
Namenode has FSN lock and FSD lock. Most of the namenode operations need to take FSN lock first and then FSD lock. The permission check is done via FSPermissionChecker at FSD layer assuming FSN lock is taken.
The FSPermissionChecker constructor invokes callerUgi.getGroups() that can take seconds sometimes. There are external cache scheme such SSSD and internal cache scheme for group lookup. However, the delay could still occur during cache refresh, which causes severe FSN lock contentions and unresponsive namenode issues.
Checking the current code, we found that getBlockLocations(..) did it right but some methods such as getFileInfo(..), getContentSummary(..) did it wrong. This ticket is open to ensure the group lookup for permission checker is outside the FSN lock.
Attachments
Attachments
- HDFS-13136.001.patch
- 67 kB
- Xiaoyu Yao
- HDFS-13136.002.patch
- 69 kB
- Xiaoyu Yao
- HDFS-13136-branch-3.0.001.patch
- 72 kB
- Xiaoyu Yao
- HDFS-13136-branch-3.0.002.patch
- 73 kB
- Xiaoyu Yao
Activity
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 34s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 20m 56s | trunk passed |
+1 | compile | 1m 2s | trunk passed |
+1 | checkstyle | 0m 45s | trunk passed |
+1 | mvnsite | 1m 2s | trunk passed |
+1 | shadedclient | 11m 51s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 56s | trunk passed |
+1 | javadoc | 0m 55s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 58s | the patch passed |
+1 | compile | 0m 50s | the patch passed |
+1 | javac | 0m 50s | the patch passed |
+1 | checkstyle | 0m 38s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 252 unchanged - 1 fixed = 252 total (was 253) |
+1 | mvnsite | 0m 56s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 10m 59s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 59s | the patch passed |
+1 | javadoc | 0m 52s | the patch passed |
Other Tests | |||
-1 | unit | 124m 12s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 21s | The patch does not generate ASF License warnings. |
180m 29s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency |
hadoop.hdfs.server.namenode.TestAuditLogger | |
hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | |
hadoop.hdfs.server.namenode.TestAuditLoggerWithCommands |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12910302/HDFS-13136.001.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux df32cca368ce 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 5a1db60 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_151 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23039/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23039/testReport/ |
Max. process+thread count | 3115 (vs. ulimit of 5500) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/23039/console |
Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
+1 the 001 patch looks good. Thanks for the fixing all the methods.
The failed tests seem not related. Please take a look.
Thanks szetszwo for the review. Update patch v2 that fixed the unit test failures in
hadoop.hdfs.server.namenode.TestAuditLogger and hadoop.hdfs.server.namenode.TestAuditLoggerWithCommands
Now that the getPermissionChecker() is moved out of the FSN lock, the test mocks are updated to reach deeper to get the expected exception and the audit log entry. The delta from v1 to v2 is the two unit test changes above. The other two failures cannot repro.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 23s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 17m 31s | trunk passed |
+1 | compile | 1m 2s | trunk passed |
+1 | checkstyle | 0m 44s | trunk passed |
+1 | mvnsite | 1m 12s | trunk passed |
+1 | shadedclient | 12m 1s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 58s | trunk passed |
+1 | javadoc | 0m 56s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 1m 0s | the patch passed |
+1 | compile | 0m 56s | the patch passed |
+1 | javac | 0m 56s | the patch passed |
+1 | checkstyle | 0m 39s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 285 unchanged - 1 fixed = 285 total (was 286) |
+1 | mvnsite | 0m 56s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 10m 56s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 2m 1s | the patch passed |
+1 | javadoc | 0m 51s | the patch passed |
Other Tests | |||
-1 | unit | 124m 14s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 24s | The patch does not generate ASF License warnings. |
177m 16s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency |
hadoop.hdfs.TestHDFSFileSystemContract | |
hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | |
hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12910611/HDFS-13136.002.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 76750bd05515 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / f20dc0d |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_151 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23068/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23068/testReport/ |
Max. process+thread count | 2894 (vs. ulimit of 5500) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/23068/console |
Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Thanks szetszwo for the review. I will commit the patch to trunk and branch-3.1 (clean cherry-pick) shortly. There are some conflicts on branch-3.0 which I just submitted a resolved patch for Jenkins check.
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13700 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13700/)
HDFS-13136. Avoid taking FSN lock while doing group member lookup for (xyao: rev 84a1321f6aa0af6895564a7c47f8f264656f0294)
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirConcatOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestAuditLoggerWithCommands.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirXAttrOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirEncryptionZoneOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EncryptionZoneManager.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestAuditLogger.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirAttrOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirMkdirOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirSnapshotOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirStatAndListingOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirRenameOp.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirAclOp.java
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 23s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
branch-3.0 Compile Tests | |||
+1 | mvninstall | 20m 55s | branch-3.0 passed |
+1 | compile | 0m 54s | branch-3.0 passed |
+1 | checkstyle | 0m 45s | branch-3.0 passed |
+1 | mvnsite | 1m 4s | branch-3.0 passed |
+1 | shadedclient | 11m 38s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 54s | branch-3.0 passed |
+1 | javadoc | 0m 56s | branch-3.0 passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 58s | the patch passed |
+1 | compile | 0m 50s | the patch passed |
+1 | javac | 0m 50s | the patch passed |
+1 | checkstyle | 0m 39s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 306 unchanged - 4 fixed = 306 total (was 310) |
+1 | mvnsite | 0m 55s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 10m 57s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 59s | the patch passed |
+1 | javadoc | 0m 52s | the patch passed |
Other Tests | |||
-1 | unit | 89m 32s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 24s | The patch does not generate ASF License warnings. |
145m 21s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLoggerWithCommands |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:20ca677 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12911600/HDFS-13136-branch-3.0.001.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux e19160d6a097 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | branch-3.0 / c2bbe22 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_151 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23153/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23153/testReport/ |
Max. process+thread count | 3853 (vs. ulimit of 10000) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/23153/console |
Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 26s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
branch-3.0 Compile Tests | |||
+1 | mvninstall | 14m 46s | branch-3.0 passed |
+1 | compile | 0m 50s | branch-3.0 passed |
+1 | checkstyle | 0m 41s | branch-3.0 passed |
+1 | mvnsite | 0m 52s | branch-3.0 passed |
+1 | shadedclient | 10m 30s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 47s | branch-3.0 passed |
+1 | javadoc | 0m 52s | branch-3.0 passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 53s | the patch passed |
+1 | compile | 0m 47s | the patch passed |
+1 | javac | 0m 47s | the patch passed |
+1 | checkstyle | 0m 36s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 306 unchanged - 4 fixed = 306 total (was 310) |
+1 | mvnsite | 0m 50s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 9m 24s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 50s | the patch passed |
+1 | javadoc | 0m 52s | the patch passed |
Other Tests | |||
-1 | unit | 127m 38s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 21s | The patch does not generate ASF License warnings. |
173m 51s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | |
hadoop.hdfs.web.TestWebHdfsTimeouts | |
hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | |
hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:20ca677 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12911632/HDFS-13136-branch-3.0.002.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux b4cb77402914 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | branch-3.0 / c2bbe22 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_151 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/23159/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/23159/testReport/ |
Max. process+thread count | 4874 (vs. ulimit of 10000) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/23159/console |
Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Hey xyao, thanks for this work, it looks to be a good improvement. One question - do you know how significant the performance improvement will be if using the background cache refresh feature from HADOOP-13263? It seems to me that with this enabled, the only improvements would be the first time that a user is ever looked up (should be very rare).
Agree, HADOOP-11238 + HADOOP-13263 definitely will help the group lookup performance issue if configured properly. Together with this fix, even the slow warm-up period (where the cache does not even have an entry for certain user) won't holding FSN lock that could trigger a failover.
HI xyao,
Thanks for your work here, could it be Resolved since it's committed?
I saw it's in branch-3.0 which will target for 3.0.3.
Thanks.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 0s | Docker mode activated. |
-1 | patch | 0m 7s | |
Subsystem | Report/Notes |
---|---|
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12911632/HDFS-13136-branch-3.0.002.patch |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/24163/console |
Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
yzhangal, thanks for the heads up. This has been committed to trunk and branch-3.1.
I saw it's in branch-3.0 which will target for 3.0.3.
Branch-3.0 patch has not been committed yet. I will need to rebase the patch and get a new Jenkins run before commit/resolve it.
Sorry, I'm wrong. This is in branch-3.0. I will resolve the ticket. Thanks yzhangal.
Attach an initial patch to move the getPermissionChecker() out of FSN lock. Thanks for the offline discussion with szetszwo.
This patch also removes the repeated group lookup from recursive calls such as FSDirStatAndListingOp#getContentSummaryInt(), which will help to improve NN performance.