Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
Ran into an issue on hbase 2.4.6, I think related to HBASE-26679. Individual writes are blocking on SyncFuture, which never gets completed. Eventually (5m) the writes timeout and fail. But the regionserver hung on like this basically forever until I killed it about 14 hours later. While 26679 may fix the hang bug, I think we should have additional protection against such zombie states. In this case I think what happened is that the rollWAL was requested due to failed appends, but it also hung forever. See the below stack trace:
Thread 240 (regionserver/host:60020.logRoller): State: WAITING Blocked count: 38 Waited count: 293 Waiting on java.util.concurrent.CompletableFuture$Signaller@13342c6d Stack: java.base@11.0.5/jdk.internal.misc.Unsafe.park(Native Method) java.base@11.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.base@11.0.5/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) java.base@11.0.5/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.base@11.0.5/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) java.base@11.0.5/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189) app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202) app//org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170) app//org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113) app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669) app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130) app//org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841) app//org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268) app//org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187)
The wall roller thread was stuck on this wait seemingly forever, so it was never able to roll the wal and get writes working again. I think we should add a timeout here, and abort the regionserver if a WAL cannot be rolled in a timely manner.
Attachments
Issue Links
- is related to
-
HBASE-26552 Introduce retry to logroller to avoid abort
- Resolved
- links to