Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-11813

CellScanner#advance may overflow stack

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 0.99.0, 0.98.6
    • None
    • None
    • Reviewed

    Description

      On user@hbase, johannes.schaback@visual-meta.com reported:

      we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.

      Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC handlers. In the .out files we have this:

      Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
      java.lang.StackOverflowError
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
      [...]
      Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
      java.lang.StackOverflowError​
      

      That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:

      ​    return new CellScanner() {
            private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator();
            private CellScanner cellScanner = null;
      
            @Override
            public Cell current() {
              return this.cellScanner != null? this.cellScanner.current(): null;
            }
      
            @Override
            public boolean advance() throws IOException {
              if (this.cellScanner == null) {
                if (!this.iterator.hasNext()) return false;
                this.cellScanner = this.iterator.next().cellScanner();
              }
              if (this.cellScanner.advance()) return true;
              this.cellScanner = null;
      --->        return advance();
            }
          };
      

      That final return statement is the immediate problem.

      We should also fix this so the RegionServer aborts if it loses a handler to an Error.

      Attachments

        1. 11813.098.txt
          10 kB
          Michael Stack
        2. 11813.098.txt
          11 kB
          Michael Stack
        3. 11813.master.txt
          9 kB
          Michael Stack
        4. 11813.master.txt
          12 kB
          Michael Stack
        5. 11813v2.master.txt
          12 kB
          Michael Stack
        6. 11813v3.master.txt
          12 kB
          Michael Stack
        7. catch_all_exceptions.txt
          0.8 kB
          Michael Stack

        Issue Links

          Activity

            People

              stack Michael Stack
              apurtell Andrew Kyle Purtell
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: