Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
I believe RATIS-2045 results in a regression. A lot of Ozone integration tests fail after including this commit, probably because nodes can't be added to a ratis ring with no logs entries.
Sample run: https://github.com/duongkame/ozone/actions/runs/8623155740/job/23636444671,
Sample failed test: TestAddRemoveOzoneManager.testBootstrap
Before the commit, seems that new nodes don't have to install snapshots (the check installSnapshot return a ALREADY_INSTALLED).
2024-04-10 17:31:36,897 [grpc-default-executor-2] INFO ratis.OzoneManagerStateMachine (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received Configuration change notification from Ratis. New Peer list: [id: "omNode-1" address: "localhost:15015" startupRole: FOLLOWER ] 2024-04-10 17:31:36,905 [grpc-default-executor-2] INFO ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) - Added OM omNode-1 to Ratis Peers list. 2024-04-10 17:31:36,906 [grpc-default-executor-2] INFO om.OzoneManager (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list. 2024-04-10 17:31:36,909 [grpc-default-executor-2] INFO impl.SnapshotInstallationHandler (SnapshotInstallationHandler.java:installSnapshot(103)) - omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0 2024-04-10 17:31:36,922 [grpc-default-executor-2] INFO server.GrpcServerProtocolService (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastRequest: omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0) 2024-04-10 17:31:36,923 [grpc-default-executor-2] INFO server.GrpcServerProtocolService (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null 2024-04-10 17:31:36,924 [grpc-default-executor-0] INFO server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler: received the first reply omNode-1<-omNode-bootstrap-1#0:OK-t0,ALREADY_INSTALLED,snapshotIndex=0 2024-04-10 17:31:36,929 [grpc-default-executor-0] INFO server.GrpcLogAppender (GrpcLogAppender.java:onNext(679)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler: Follower snapshot is already at index 0. 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo (FollowerInfoImpl.java:info(64)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: matchIndex: setUnconditionally -1 -> 0 2024-04-10 17:31:36,930 [grpc-default-executor-0] INFO leader.FollowerInfo (FollowerInfoImpl.java:info(64)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1: nextIndex: setUnconditionally 0 -> 1
After the commit, a new node seems to have to install a snapshot (empty).
2024-04-10 17:46:58,830 [grpc-default-executor-8] INFO ratis.OzoneManagerStateMachine (OzoneManagerStateMachine.java:notifyConfigurationChanged(212)) - Received Configuration change notification from Ratis. New Peer list: [id: "omNode-1" address: "localhost:15015" startupRole: FOLLOWER ] 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO ratis.OzoneManagerRatisServer (OzoneManagerRatisServer.java:addRaftPeer(434)) - Added OM omNode-1 to Ratis Peers list. 2024-04-10 17:46:58,842 [grpc-default-executor-8] INFO om.OzoneManager (OzoneManager.java:addOMNodeToPeers(2042)) - Added OM omNode-1 to the Peer list. 2024-04-10 17:46:58,847 [grpc-default-executor-8] INFO impl.SnapshotInstallationHandler (SnapshotInstallationHandler.java:installSnapshot(103)) - omNode-bootstrap-1@group-0AAC5367B30E: reply installSnapshot: omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0 2024-04-10 17:46:58,862 [omNode-bootstrap-1-InstallSnapshotThread] INFO ratis_snapshot.OmRatisSnapshotProvider (OmRatisSnapshotProvider.java:downloadSnapshot(146)) - Downloading latest checkpoint from Leader OM omNode-1. Checkpoint URL: http://127.0.0.1:15013/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true 2024-04-10 17:46:58,870 [grpc-default-executor-8] INFO server.GrpcServerProtocolService (GrpcServerProtocolService.java:onCompleted(200)) - omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastRequest: omNode-1->omNode-bootstrap-1#0-t1,notify:(t:1, i:0) 2024-04-10 17:46:58,871 [grpc-default-executor-7] INFO server.GrpcLogAppender (GrpcLogAppender.java:onNext(658)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler: received the first reply omNode-1<-omNode-bootstrap-1#0:FAIL-t0,IN_PROGRESS,snapshotIndex=0 2024-04-10 17:46:58,875 [grpc-default-executor-7] INFO server.GrpcLogAppender (GrpcLogAppender.java:onNext(674)) - omNode-1@group-0AAC5367B30E->omNode-bootstrap-1-InstallSnapshotResponseHandler: InstallSnapshot in progress. 2024-04-10 17:46:58,876 [grpc-default-executor-8] INFO server.GrpcServerProtocolService (GrpcServerProtocolService.java:lambda$onCompleted$7(202)) - omNode-bootstrap-1: Completed INSTALL_SNAPSHOT, lastReply: null
And this snapshot installation seems to always fail, because no checkpoints is created (because there is no logs?).
2024-04-10 18:13:10,601 [qtp1588976146-447] ERROR utils.DBCheckpointServlet (DBCheckpointServlet.java:generateSnapshotCheckpoint(238)) - Unable to process metadata snapshot request. java.nio.file.NoSuchFileException: /Users/duong/workspaces/secondary/ozone2/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-59c4c0e4-c6ce-49e7-a03a-2f973a460919/ozone-meta/db.snapshots at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407) at java.nio.file.Files.newDirectoryStream(Files.java:457) at java.nio.file.Files.list(Files.java:3451) at org.apache.hadoop.ozone.om.OMDBCheckpointServlet.processDir(OMDBCheckpointServlet.java:390) at org.apache.hadoop.ozone.om.OMDBCheckpointServlet.getFilesForArchive(OMDBCheckpointServlet.java:322) at org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:172) at org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:220) at org.apache.hadoop.hdds.utils.DBCheckpointServlet.doPost(DBCheckpointServlet.java:321) at javax.servlet.http.HttpServlet.service(HttpServlet.java:523) at javax.servlet.http.HttpServlet.service(HttpServlet.java:590) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
Attachments
Issue Links
- is caused by
-
RATIS-2045 SnapshotInstallationHandler doesn't notify follower when snapshotIndex is -1 and firstAvailableLogIndex is 0
- Resolved
- is related to
-
HDDS-6077 Intermittent assertion failure in TestOzoneManagerBootstrap#testBootstrapWithoutConfigUpdate
- Resolved
-
HDDS-10924 TestSCMHAManagerImpl#testAddSCM fails on ratis master
- Resolved
- links to