Details
Description
Image transfer from Standby NameNode to Active silently fails on Active, without any logging and not notifying the receiver side.
Attachments
Attachments
- HDFS-15036.001.patch
- 3 kB
- Chen Liang
- HDFS-15036.002.patch
- 5 kB
- Chen Liang
- HDFS-15036.003.patch
- 5 kB
- Chen Liang
Issue Links
- duplicates
-
HDFS-15287 HDFS rollingupgrade prepare never finishes
- Resolved
- relates to
-
HDFS-15287 HDFS rollingupgrade prepare never finishes
- Resolved
Activity
Spent some time debugging this issue, I think I found the cause of the issue.
In HDFS-12979, we introduced a logic that, if a image being uploaded is not too far ahead of the previous image, this image upload request is rejected. This is to prevent the scenario when there are multiple SbNs, all SbNs upload images to ANN too frequently. This is considered as correct behavior, so there is no logging indication of any error or anything here (the being "silent" part). Both ANN and SbN simply ignore and proceed.
But now it appears that, a side effect of this change, is that during RU, the rollback image also has to go through this check, and it could also be rejected. If this happens, SbN proceeds assuming upload is done, while ANN proceeds with still not receiving the rollback image. The upload silently failed in this case.
The check logic that rejects the upload is in ImageServlet. In my earlier test, I just commented out the whole block below and the issue seems gone. But I think the fix is probably just adding a new check to ensure this rejection only applies to regular image upload, not rollback image, like the newly added line in the line in the follow code snippet. But I haven't actually tested changing it this way.:
if (checkRecentImageEnable && NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) && // <--- this should fix the issue, as NameNodeFile.IMAGE_ROLLBACK should bypass this timeDelta < checkpointPeriod && txid - lastCheckpointTxid < checkpointTxnCount) { // only when at least one of two conditions are met we accept // a new fsImage // 1. most recent image's txid is too far behind // 2. last checkpoint time was too old response.sendError(HttpServletResponse.SC_CONFLICT, "Most recent checkpoint is neither too far behind in " + "txid, nor too old. New txnid cnt is " + (txid - lastCheckpointTxid) + ", expecting at least " + checkpointTxnCount + " unless too long since last upload."); return null; }
vagarychen sorry for grabbing this JIRA too soon Since you have done much study on this, do you want to take this JIRA instead?
Good investigation and findings vagarychen.
- Could you add a comment explaining that ImageServlet should not reject images other than checkpoints.
- I am still concerned about the "silent" part. Should we add some logging, so that next time we could see what happened on both nodes.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 47s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 20m 23s | trunk passed |
+1 | compile | 1m 0s | trunk passed |
+1 | checkstyle | 0m 44s | trunk passed |
+1 | mvnsite | 1m 11s | trunk passed |
+1 | shadedclient | 14m 43s | branch has no errors when building and testing our client artifacts. |
-1 | findbugs | 2m 17s | hadoop-hdfs-project/hadoop-hdfs in trunk has 1 extant Findbugs warnings. |
+1 | javadoc | 1m 20s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 1m 10s | the patch passed |
+1 | compile | 0m 57s | the patch passed |
+1 | javac | 0m 57s | the patch passed |
+1 | checkstyle | 0m 39s | the patch passed |
+1 | mvnsite | 1m 0s | the patch passed |
+1 | whitespace | 0m 1s | The patch has no whitespace issues. |
+1 | shadedclient | 13m 31s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 2m 20s | the patch passed |
+1 | javadoc | 1m 10s | the patch passed |
Other Tests | |||
-1 | unit | 99m 20s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 32s | The patch does not generate ASF License warnings. |
162m 55s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics |
hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | |
hadoop.hdfs.server.namenode.TestNamenodeCapacityReport | |
hadoop.hdfs.TestReconstructStripedFile | |
hadoop.hdfs.server.namenode.TestFsck |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12988378/HDFS-15036.001.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 0e77d17e1e66 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / dc66de7 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/28488/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28488/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28488/testReport/ |
Max. process+thread count | 2787 (vs. ulimit of 5500) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28488/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Thanks for taking a look shv! Post v002 patch. And the failed tests all passed in my local run.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 43s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 21m 7s | trunk passed |
+1 | compile | 1m 9s | trunk passed |
+1 | checkstyle | 0m 46s | trunk passed |
+1 | mvnsite | 1m 15s | trunk passed |
+1 | shadedclient | 15m 17s | branch has no errors when building and testing our client artifacts. |
-1 | findbugs | 2m 34s | hadoop-hdfs-project/hadoop-hdfs in trunk has 1 extant Findbugs warnings. |
+1 | javadoc | 1m 23s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 1m 11s | the patch passed |
+1 | compile | 0m 58s | the patch passed |
+1 | javac | 0m 58s | the patch passed |
-0 | checkstyle | 0m 40s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 88 unchanged - 0 fixed = 89 total (was 88) |
+1 | mvnsite | 1m 8s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 13m 50s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 2m 29s | the patch passed |
+1 | javadoc | 1m 14s | the patch passed |
Other Tests | |||
-1 | unit | 103m 17s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 38s | The patch does not generate ASF License warnings. |
169m 24s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.qjournal.client.TestQuorumJournalManager |
hadoop.hdfs.server.datanode.TestBPOfferService | |
hadoop.hdfs.TestFileAppend2 | |
hadoop.hdfs.server.namenode.TestFsck | |
hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA | |
hadoop.hdfs.qjournal.client.TestQJMWithFaults | |
hadoop.hdfs.TestWriteRead |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12988468/HDFS-15036.002.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 21686e70fb56 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 875a3e9 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/28497/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28497/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28497/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28497/testReport/ |
Max. process+thread count | 2270 (vs. ulimit of 5500) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28497/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Looks good. Minor things
- Typo in doCheckpoint(). Removed
isin:// by the other node. This could happen if
- Should use parameterized logging
LOG.info("Image upload rejected by the other NameNode: {}", uploadResult);
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 49s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 1s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 23m 35s | trunk passed |
+1 | compile | 1m 13s | trunk passed |
+1 | checkstyle | 1m 2s | trunk passed |
+1 | mvnsite | 1m 31s | trunk passed |
+1 | shadedclient | 17m 50s | branch has no errors when building and testing our client artifacts. |
-1 | findbugs | 2m 49s | hadoop-hdfs-project/hadoop-hdfs in trunk has 1 extant Findbugs warnings. |
+1 | javadoc | 1m 30s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 1m 15s | the patch passed |
+1 | compile | 1m 10s | the patch passed |
+1 | javac | 1m 10s | the patch passed |
-0 | checkstyle | 0m 41s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 88 unchanged - 0 fixed = 89 total (was 88) |
+1 | mvnsite | 1m 20s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 15m 6s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 2m 19s | the patch passed |
+1 | javadoc | 1m 9s | the patch passed |
Other Tests | |||
-1 | unit | 107m 35s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 36s | The patch does not generate ASF License warnings. |
180m 45s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.namenode.TestFsck |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12988486/HDFS-15036.003.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 32b29ff6bfad 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / c2e9783 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/28499/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28499/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28499/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28499/testReport/ |
Max. process+thread count | 3176 (vs. ulimit of 5500) |
modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28499/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
+1 on v03 patch.
TestFsck failure is tracked under HDFS-15038.
And the checkstyle warning is bogus.
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17758 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17758/)
HDFS-15036. Active NameNode should not silently fail the image transfer. (cliang: rev 65c4660bcd897e139fc175ca438cff75ec0c6be8)
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ImageServlet.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestRollingUpgrade.java
Thanks shv! I've committed to trunk and branch-2, will commit to branch-3.2 and branch-3.1 shortly as well.
vagarychen we should commit to branch-2.10. branch-2 was deleted as per discussion on hdfs-dev.
Oops! Did not realize it's already deleted, guess I missed the messages... will work on deleting it again...
Jim_Brennan I filed https://issues.apache.org/jira/browse/INFRA-19581, but haven't got update from Infra folks yet.
This can happen during checkpointing or preparing for a rolling upgrade.
We observed it during rolling upgrade, when Standby was reporting: "Rollback image has been created. Proceed to upgrade daemons." While Active still reported " Rollback image has not been created."
In the logs for ANN I see that it started receiving the image:
But ANN did not print anything related to the image transfer afterwards. And the transferred image is missing in its storage directory.
The ANN log message comes from isValidRequestor() called by ImageServlet.doPut().
SBN log indicates that the image was fully and successfully transferred to ANN
The SBN log message comes from TransferFsImage.copyFileToStream().
Looking at the code in ImageServlet.doPut() I see that in one of the methods it calls Util.receiveFile() if an Exception is thrown inside the while-loop performing reading from the input (socket) stream and writing to the output (image file) stream, then it will go through a series of finalized sections without catching the exception and logging it or reporting the error to the sender.
We should: