Details
-
Sub-task
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
In a scenario where I'm conducting lease recovery on multiple files during a rolling restart, the OM encounters abrupt failure subsequent to the restart of Ozone Managers (OMs).
2024-03-31 09:47:01,866 ERROR [om72-OMStateMachineApplyTransactionThread - 0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with exit status 1: Request cmdType: RecoverLease traceID: "" clientId: "client-433C04E5C8CC" userInfo { userName: "hdfs@XYZ" remoteAddress: "xx.yy.ww.zz" hostName: "vb1307.xyz.com" } version: 3 layoutVersion { version: 6 } RecoverLeaseRequest { volumeName: "hsyncvol" bucketName: "hsyncbuck" keyName: "hsync/File_24.txt" force: false } failed with exception java.lang.NullPointerException: SecretKey client must have been initialized already. at java.util.Objects.requireNonNull(Objects.java:228) at org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.getCurrentSecretKey(DefaultSecretKeySignerClient.java:70) at org.apache.hadoop.hdds.security.token.ShortLivedTokenSecretManager.createPassword(ShortLivedTokenSecretManager.java:47) at org.apache.hadoop.hdds.security.token.OzoneBlockTokenSecretManager.generateToken(OzoneBlockTokenSecretManager.java:70) at org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.updateBlockInfo(OMRecoverLeaseRequest.java:281) at org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.doWork(OMRecoverLeaseRequest.java:264) at org.apache.hadoop.ozone.om.request.file.OMRecoverLeaseRequest.validateAndUpdateCache(OMRecoverLeaseRequest.java:156) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406) at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404) at org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Have seen this 2-3 times, and this time I was able to repro it when Lease recovery is happening during RR phase.
Attachments
Issue Links
- links to