Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.4.0
-
None
-
None
Description
After running:
hbase org.apache.hadoop.hbase.util.LoadTestTool -start_key 0 -num_keys 100000 -bloom ROWCOL -compression NONE -write 10:100:100 -read 100:20 -tn loadtest_d1
the command fails after being in execution state for nearly 2.5 Hours. There were many exceptions like below:
2023-03-15 20:11:12,660|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f| 2023-03-15 20:11:12,660|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:142) 2023-03-15 20:11:12,661|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.client.HTable.put(HTable.java:514) 2023-03-15 20:11:12,661|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.util.MultiThreadedWriter$HBaseWriterThread.insert(MultiThreadedWriter.java:144) 2023-03-15 20:11:12,661|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.util.MultiThreadedWriter$HBaseWriterThread.run(MultiThreadedWriter.java:117) 2023-03-15 20:11:12,662|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to address=ozpc4-cf4-2.ozpc4-cf4.root.hwx.site/172.27.27.133:22101 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=162625,methodName=Mutate], waitTime=60011ms, rpcTimeout=60000ms 2023-03-15 20:11:12,662|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:223) 2023-03-15 20:11:12,662|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:384) 2023-03-15 20:11:12,662|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88) 2023-03-15 20:11:12,663|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:415) 2023-03-15 20:11:12,663|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:411) 2023-03-15 20:11:12,663|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107) 2023-03-15 20:11:12,663|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136) 2023-03-15 20:11:12,664|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) 2023-03-15 20:11:12,664|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) 2023-03-15 20:11:12,664|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) 2023-03-15 20:11:12,664|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) 2023-03-15 20:11:12,665|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) 2023-03-15 20:11:12,665|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at java.lang.Thread.run(Thread.java:750) 2023-03-15 20:11:12,665|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=162625,methodName=Mutate], waitTime=60011ms, rpcTimeout=60000ms 2023-03-15 20:11:12,665|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137) 2023-03-15 20:11:12,666|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|... 6 more 2023-03-15 20:11:12,666|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|
But still the keys were getting written(very slowly tho, for HBase on HDFS, this command takes approx 3-4 minutes), towards the end the key write starts failing:
2023-03-15 22:30:21,612|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|23/03/15 22:30:21 INFO util.MultiThreadedAction: [W:101] Keys=99680, cols=1.1 M, time=02:37:21 Overall: [keys/s= 10, latency=4268.13 ms] Current: [keys/s=47, latency=1949.67 ms], wroteUpTo=99179, wroteQSize=400 2023-03-15 22:30:26,566|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|23/03/15 22:30:26 INFO util.MultiThreadedAction: [R:20] Keys=7254451, cols=86.6 M, time=02:37:26 Overall: [keys/s= 767, latency=20.86 ms] Current: [keys/s=964, latency=23.55 ms], verified=7254451, READ FAILURES=66 2023-03-15 22:30:26,608|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|23/03/15 22:30:26 INFO util.MultiThreadedAction: [W:82] Keys=100000, cols=1.1 M, time=02:37:26 Overall: [keys/s= 10, latency=4259.79 ms] Current: [keys/s=64, latency=1662.05 ms], wroteUpTo=99770, wroteQSize=148 2023-03-15 22:30:29,978|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write keys: 151 2023-03-15 22:30:30,027|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13207 2023-03-15 22:30:30,072|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13223 2023-03-15 22:30:30,114|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13240 2023-03-15 22:30:30,159|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13243 2023-03-15 22:30:30,204|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13255 2023-03-15 22:30:30,252|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13256 2023-03-15 22:30:30,296|INFO|MainThread|machine.py:203 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13260 2023-03-15 22:30:30,341|INFO|MainThread|machine.py:210 - run()||GUID=12e6a591-a42f-4d02-937a-bdc9d62acd1f|Failed to write key: 13271
Checked the Master log during that time, was getting org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl:
2023-03-15 22:30:43,417 INFO org.apache.hadoop.hbase.master.SplitLogManager: ofs://ozone1/vol1/bucket1/hbase/WALs/ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027-splitting dir is empty, no logs to split. 2023-03-15 22:30:43,417 INFO org.apache.hadoop.hbase.master.SplitWALManager: ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027 WAL count=0, meta=false 2023-03-15 22:30:43,449 ERROR org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl: delete key failed Directory is not empty. Key:hbase/WALs/ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027-splitting 2023-03-15 22:30:43,449 WARN org.apache.hadoop.hbase.master.MasterWalManager: Failed archiving meta log for server ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027 org.apache.hadoop.fs.PathIsNotEmptyDirectoryException: `Directory is not empty. Key:hbase/WALs/ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027-splitting': Directory is not empty at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.deleteObject(BasicRootedOzoneClientAdapterImpl.java:548) at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.delete(BasicRootedOzoneFileSystem.java:646) at org.apache.hadoop.hbase.master.MasterWalManager.archiveMetaLog(MasterWalManager.java:395) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.cleanupSplitDir(ServerCrashProcedure.java:280) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:225) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:936) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,473 ERROR org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl: delete key failed Directory is not empty. Key:hbase/WALs/ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027-splitting 2023-03-15 22:30:43,476 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false java.lang.NullPointerException at java.util.Arrays.stream(Arrays.java:5004) at org.apache.hadoop.hbase.master.SplitWALManager.deleteWALDir(SplitWALManager.java:132) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.cleanupSplitDir(ServerCrashProcedure.java:282) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:225) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:936) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,534 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=419, state=FAILED:SERVER_CRASH_SPLIT_LOGS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_DELETE_SPLIT_WALS_DIR at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:378) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:212) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:952) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1502) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1353) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,547 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=419, state=FAILED:SERVER_CRASH_GET_REGIONS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_SPLIT_LOGS at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:378) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:212) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:952) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1502) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1353) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,578 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=419, state=FAILED:SERVER_CRASH_START, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_GET_REGIONS at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:378) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:212) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:952) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1502) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1353) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,592 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=419, state=FAILED:SERVER_CRASH_START, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_START at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:378) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:65) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:212) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:952) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1502) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1353) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:73) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1968) 2023-03-15 22:30:43,607 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Rolled back pid=419, state=ROLLEDBACK, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=419, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_WALS_DIR, locked=true; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure ozpc4-cf4-6.ozpc4-cf4.root.hwx.site,22101,1678906419027, splitWal=true, meta=false exec-time=297 msec 2023-03-15 22:31:02,422 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [ozpc4-cf4-4.ozpc4-cf4.root.hwx.site,22101,1678906419528] 2023-03-15 22:31:02,424 INFO org.apache.hadoop.hbase.master.ServerManager: Processing expiration of ozpc4-cf4-4.ozpc4-cf4.root.hwx.site,22101,1678906419528 on ozpc4-cf4-1.ozpc4-cf4.root.hwx.site,22001,1678906420490