Description
On user@hbase, johannes.schaback@visual-meta.com reported:
we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.
Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC handlers. In the .out files we have this:
Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" java.lang.StackOverflowError at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) [...] Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" java.lang.StackOverflowError Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" java.lang.StackOverflowError​
That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:
​ return new CellScanner() { private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator(); private CellScanner cellScanner = null; @Override public Cell current() { return this.cellScanner != null? this.cellScanner.current(): null; } @Override public boolean advance() throws IOException { if (this.cellScanner == null) { if (!this.iterator.hasNext()) return false; this.cellScanner = this.iterator.next().cellScanner(); } if (this.cellScanner.advance()) return true; this.cellScanner = null; ---> return advance(); } };
That final return statement is the immediate problem.
We should also fix this so the RegionServer aborts if it loses a handler to an Error.
Attachments
Attachments
Issue Links
- relates to
-
HBASE-12028 Abort the RegionServer, when it's handler threads die
- Closed