Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-11813

CellScanner#advance may overflow stack

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.99.0, 0.98.6
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On user@hbase, johannes.schaback@visual-meta.com reported:

      we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.

      Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC handlers. In the .out files we have this:

      Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
      java.lang.StackOverflowError
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
              at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
      [...]
      Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
      java.lang.StackOverflowError
      Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
      java.lang.StackOverflowError​
      

      That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:

      ​    return new CellScanner() {
            private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator();
            private CellScanner cellScanner = null;
      
            @Override
            public Cell current() {
              return this.cellScanner != null? this.cellScanner.current(): null;
            }
      
            @Override
            public boolean advance() throws IOException {
              if (this.cellScanner == null) {
                if (!this.iterator.hasNext()) return false;
                this.cellScanner = this.iterator.next().cellScanner();
              }
              if (this.cellScanner.advance()) return true;
              this.cellScanner = null;
      --->        return advance();
            }
          };
      

      That final return statement is the immediate problem.

      We should also fix this so the RegionServer aborts if it loses a handler to an Error.

        Attachments

        1. catch_all_exceptions.txt
          0.8 kB
          Michael Stack
        2. 11813v3.master.txt
          12 kB
          Michael Stack
        3. 11813v2.master.txt
          12 kB
          Michael Stack
        4. 11813.master.txt
          12 kB
          Michael Stack
        5. 11813.master.txt
          9 kB
          Michael Stack
        6. 11813.098.txt
          11 kB
          Michael Stack
        7. 11813.098.txt
          10 kB
          Michael Stack

          Issue Links

            Activity

              People

              • Assignee:
                stack Michael Stack
                Reporter:
                apurtell Andrew Kyle Purtell
              • Votes:
                1 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: