Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Triggering ImportTSV during Rolling restart is failing.
Debugged the issue, and its reproducible everytime when the "reducers" are getting used by ImportTSV and at the same time there is a OM rolling restart stage going on.
2024-04-22 10:15:41,159|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|24/04/22 10:15:41 INFO mapreduce.Job: map 100% reduce 69% 2024-04-22 10:15:43,169|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|24/04/22 10:15:43 INFO mapreduce.Job: map 100% reduce 70% 2024-04-22 10:15:49,198|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|24/04/22 10:15:49 INFO mapreduce.Job: map 100% reduce 71% 2024-04-22 10:16:29,396|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|24/04/22 10:16:29 INFO mapreduce.Job: Task Id : attempt_1713778160624_0007_r_000072_0, Status : FAILED 2024-04-22 10:16:29,434|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|Error: org.apache.hadoop.security.token.SecretManager$InvalidToken: Tampered/Invalid token. 2024-04-22 10:16:29,434|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 2024-04-22 10:16:29,434|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 2024-04-22 10:16:29,434|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 2024-04-22 10:16:29,435|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 2024-04-22 10:16:29,435|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) 2024-04-22 10:16:29,435|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110) 2024-04-22 10:16:29,435|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ozone.client.OzoneClientFactory.getClientProtocol(OzoneClientFactory.java:253) 2024-04-22 10:16:29,435|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ozone.client.OzoneClientFactory.getRpcClient(OzoneClientFactory.java:115) 2024-04-22 10:16:29,436|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.<init>(BasicRootedOzoneClientAdapterImpl.java:201) 2024-04-22 10:16:29,436|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.ozone.RootedOzoneClientAdapterImpl.<init>(RootedOzoneClientAdapterImpl.java:51) 2024-04-22 10:16:29,436|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.ozone.RootedOzoneFileSystem.createAdapter(RootedOzoneFileSystem.java:111) 2024-04-22 10:16:29,436|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.initialize(BasicRootedOzoneFileSystem.java:189) 2024-04-22 10:16:29,436|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3451) 2024-04-22 10:16:29,437|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:161) 2024-04-22 10:16:29,437|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3556) 2024-04-22 10:16:29,437|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3503) 2024-04-22 10:16:29,437|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:521) 2024-04-22 10:16:29,437|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:269) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:173) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at java.security.AccessController.doPrivileged(Native Method) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at javax.security.auth.Subject.doAs(Subject.java:422) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) 2024-04-22 10:16:29,438|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Tampered/Invalid token. 2024-04-22 10:16:29,439|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1616) 2024-04-22 10:16:29,439|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.Client.call(Client.java:1562) 2024-04-22 10:16:29,439|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.Client.call(Client.java:1459) 2024-04-22 10:16:29,439|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) 2024-04-22 10:16:29,439|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) 2024-04-22 10:16:29,440|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at com.sun.proxy.$Proxy17.submitRequest(Unknown Source) 2024-04-22 10:16:29,440|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) 2024-04-22 10:16:29,440|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2024-04-22 10:16:29,440|INFO|Thread-37|machine.py:205 - run()||GUID=51e988c6-6805-43d1-9290-eb6f667ac2dd|at java.lang.reflect.Method.invoke(Method.java:498)
Checked the leader OM logs, shows below:
2024-04-22 10:16:24,671 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.133.64:46032:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.) 2024-04-22 10:16:24,671 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.68.1:43592:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.) 2024-04-22 10:16:24,672 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.170.2:41974:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.) 2024-04-22 10:16:24,672 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.133.64:46020:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.) 2024-04-22 10:16:24,672 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.11.131:50274:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.) 2024-04-22 10:16:24,675 WARN [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.140.11.131:50290:null (DIGEST-MD5: IO error acquiring password) with true cause: (OM:om102 is not the leader. Could not determine the leader node.)