Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Note: Client machine where YCSB is triggered is in UTC tz, and Cluster is in PDT
YCSB Insert fails after ~13 hours.
2024-03-18 17:31:23,529|INFO|Thread-92|machine.py:205 - run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:23:519 45480 sec: 840736 operations; 6.7 current ops/sec; est completion in 24 days 12 hours [INSERT: Count=67, Max=18284543, Min=21808, Avg=1205365.73, 90=2609151, 99=5935103, 99.9=18284543, 99.99=18284543] 2024-03-18 17:31:28,966|INFO|Thread-92|machine.py:205 - run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|Error inserting, not retrying any more. number of attempts: 1Insertion Retry Limit: 0 2024-03-18 17:31:33,537|INFO|Thread-92|machine.py:205 - run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:33:519 45490 sec: 840788 operations; 5.2 current ops/sec; est completion in 24 days 12 hours [CLEANUP: Count=2, Max=139, Min=10, Avg=74.5, 90=139, 99=139, 99.9=139, 99.99=139] [INSERT: Count=52, Max=5058559, Min=20768, Avg=1127204.46, 90=2635775, 99=4816895, 99.9=5058559, 99.99=5058559] [INSERT-FAILED: Count=1, Max=38305791, Min=38273024, Avg=38289408, 90=38305791, 99=38305791, 99.9=38305791, 99.99=38305791]
Checked the logs of Regionservers, almost all shows similar logs during the failure time:
2024-03-18 10:31:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed table_ycsb1710695116,user8499,1710695130852.aa5c9ba2da2404b2eadae58e41e507e7. 2024-03-18 10:31:10,200 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed table_ycsb1710695116,user8399,1710695130852.bb1a196fb5cd18429103e4fcde2e1e40. at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412) at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107) at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) 2024-03-18 10:31:15,202 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system stopped. 2024-03-18 10:31:15,703 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-03-18 10:31:15,706 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-03-18 10:31:15,706 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system started 2024-03-18 10:31:19,676 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition server { host_name: "vc0140.example.com" port: 22101 start_code: 1710689975823 } transition { transition_code: OPENED region_info { region_id: 1710737533661 table_name { namespace: "default" qualifier: "table_ycsb1710737518" } start_key: "user4399" end_key: "user4499" offline: false split: false replica_id: 0 } open_seq_num: 9915 proc_id: 5757 }; retry (#0) immediately. org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to address=vc0132.example.com/10.17.207.42:22001 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms, rpcTimeout=10000ms at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412) at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107) at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms, rpcTimeout=10000ms at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137) ... 6 more 2024-03-18 10:31:19,867 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Slow sync cost: 2528 ms, current pipeline: [] 2024-03-18 10:31:20,207 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition server { host_name: "vc0140.example.com" port: 22101 start_code: 1710689975823 } transition { transition_code: CLOSED region_info { region_id: 1710695130852 table_name { namespace: "default" qualifier: "table_ycsb1710695116" } start_key: "user8399" end_key: "user8499" offline: false split: false replica_id: 0 } proc_id: 5750 }; retry (#0) immediately. org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to address=vc0132.example.com/10.17.207.42:22001 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms, rpcTimeout=10000ms at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412) at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107) at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms, rpcTimeout=10000ms at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137) ... 6 more
Checked the OM Leader log around this time, can see the below error logs:
2024-03-18 10:40:28,166 ERROR [om163-OMStateMachineApplyTransactionThread - 0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key delete failed. Volume:volhbase1, Bucket:buckethbase1, Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd. DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException: Directory is not empty. Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd at org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406) at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404) at org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-03-18 10:40:33,261 ERROR [om163-OMStateMachineApplyTransactionThread - 0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key delete failed. Volume:volhbase1, Bucket:buckethbase1, Key:hbase/archive/data/default/table_ycsb1710695116. DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException: Directory is not empty. Key:hbase/archive/data/default/table_ycsb1710695116 at org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406) at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404) at org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-03-18 10:40:36,149 INFO [Socket Reader #1 for port 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hue/vc0131.example.com@EXAMPLE.COM (auth:KERBEROS)
cc: weichiu