Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26715

Blocked on SyncFuture in AsyncProtobufLogWriter#write

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.5.0, 3.0.0-alpha-3, 2.4.11
    • None
    • Reviewed

    Description

      Ran into an issue on hbase 2.4.6, I think related to HBASE-26679. Individual writes are blocking on SyncFuture, which never gets completed. Eventually (5m) the writes timeout and fail. But the regionserver hung on like this basically forever until I killed it about 14 hours later. While 26679 may fix the hang bug, I think we should have additional protection against such zombie states. In this case I think what happened is that the rollWAL was requested due to failed appends, but it also hung forever. See the below stack trace:

       

      Thread 240 (regionserver/host:60020.logRoller):
        State: WAITING
        Blocked count: 38
        Waited count: 293
        Waiting on java.util.concurrent.CompletableFuture$Signaller@13342c6d
        Stack:
          java.base@11.0.5/jdk.internal.misc.Unsafe.park(Native Method)
          java.base@11.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
          java.base@11.0.5/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
          java.base@11.0.5/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
          java.base@11.0.5/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
          java.base@11.0.5/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
          app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
          app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
          app//org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
          app//org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
          app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
          app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
          app//org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
          app//org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
          app//org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187) 

       

      The wall roller thread was stuck on this wait seemingly forever, so it was never able to roll the wal and get writes working again. I think we should add a timeout here, and abort the regionserver if a WAL cannot be rolled in a timely manner.

      Attachments

        Issue Links

          Activity

            People

              apurtell Andrew Kyle Purtell
              bbeaudreault Bryan Beaudreault
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: