HBase
  1. HBase
  2. HBASE-5249

Using a stale explicit row lock in a Put triggers an NPE

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.3
    • Component/s: regionserver
    • Labels:
      None

      Description

      After acquiring an explicit row lock, if one attempts to send Put after the row lock has expired, an NPE is triggered in the RegionServer, instead of throwing an UnknownRowLockException back to the client.

      2012-01-20 17:09:54,074 ERROR
      org.apache.hadoop.hbase.regionserver.HRegionServer: Error obtaining
      row lock (fsOk: true)
      java.lang.NullPointerException
             at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881)
             at org.apache.hadoop.hbase.regionserver.HRegionServer.addRowLock(HRegionServer.java:2313)
             at org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2299)
             at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
             at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
             at java.lang.reflect.Method.invoke(Method.java:597)
             at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
             at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1327)
      
      It happened only once out of thousands of RPCs that grabbed and
      released a row lock.
      

        Issue Links

          Activity

          Hide
          stack added a comment -

          Copying over Ted comment that was in hbase-5171 (resolved as a duplicate):

          Zhihong Yu added a comment - 10/Jan/12 18:03
          One workaround is to increase the value for "hbase.rowlock.wait.duration"
          But a new exception should be introduced anyways.
          

          Yves says up on the list:

          After checking the source code I've noticed that the value which is going to be put into the HashMap can be null in the case where the waitForLock flag is true or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, line 2111ff). The latter I think happens in our case as we have heavy load hitting the server.
          
          Show
          stack added a comment - Copying over Ted comment that was in hbase-5171 (resolved as a duplicate): Zhihong Yu added a comment - 10/Jan/12 18:03 One workaround is to increase the value for "hbase.rowlock.wait.duration" But a new exception should be introduced anyways. Yves says up on the list: After checking the source code I've noticed that the value which is going to be put into the HashMap can be null in the case where the waitForLock flag is true or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, line 2111ff). The latter I think happens in our case as we have heavy load hitting the server.
          Hide
          Benoit Sigoure added a comment -

          This was a bug in asynchbase, as explain in the GitHub issue linked above. I'm lowering the priority to minor. The bug in HBase is that the RegionServer couldn't determine that the Put that came with an explicit row lock was trying to use a lock id that didn't exist. So it was NPE'ing instead of throwing an UnknownRowLockException back to the client.

          This can happen with HTable too if one acquires an explicit row lock, wait for it to expire, and then attempt to use it in a Put.

          Show
          Benoit Sigoure added a comment - This was a bug in asynchbase, as explain in the GitHub issue linked above. I'm lowering the priority to minor. The bug in HBase is that the RegionServer couldn't determine that the Put that came with an explicit row lock was trying to use a lock id that didn't exist. So it was NPE'ing instead of throwing an UnknownRowLockException back to the client. This can happen with HTable too if one acquires an explicit row lock, wait for it to expire, and then attempt to use it in a Put .
          Hide
          Benoit Sigoure added a comment -

          Related asynchbase issue: https://github.com/stumbleupon/asynchbase/issues/16

          I'm now wondering whether the problem is really in HBase. Could be a bug in asynchbase too. I'm going to investigate after lunch. Even if it's a bug in asynchbase, HBase shouldn't NPE like that. There's some error handling that's missing in the RegionServer.

          But if this turns out to be an asynchbase-only bug, we can lower the severity of this bug.

          Show
          Benoit Sigoure added a comment - Related asynchbase issue: https://github.com/stumbleupon/asynchbase/issues/16 I'm now wondering whether the problem is really in HBase. Could be a bug in asynchbase too. I'm going to investigate after lunch. Even if it's a bug in asynchbase, HBase shouldn't NPE like that. There's some error handling that's missing in the RegionServer. But if this turns out to be an asynchbase-only bug, we can lower the severity of this bug.
          Hide
          Benoit Sigoure added a comment -

          I can actually consistently reproduce this bug on a fresh deployment of OpenTSDB with HBase 0.92.

          Show
          Benoit Sigoure added a comment - I can actually consistently reproduce this bug on a fresh deployment of OpenTSDB with HBase 0.92.
          Hide
          Benoit Sigoure added a comment -

          I just ran into this again, still on JD's 0.92 test cluster.

          From the logs of OpenTSDB:

          2012-01-24 18:32:17,605 ERROR [New I/O server worker #1-1] UniqueId: Put failed, attempts left=5 (retrying in 800 ms), put=PutRequest(table="tsdb-uid", key="\x00", family="id", qualifier="metrics
          ", value=[0, 0, 0, 0, 0, 0, 0, 1], lockid=-1, durable=true, bufferable=false, attempt=0, region=RegionInfo(table="tsdb-uid", region_name="tsdb-uid,,1327429528678.c421780d32aae9959a1b821a441fca86.", stop_key=""))
          org.hbase.async.RemoteException: java.io.IOException: java.lang.NullPointerException
          
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1076)
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1065)
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1815)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
                  at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348)
          Caused by: java.lang.NullPointerException
                  at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:922)
                  at org.apache.hadoop.hbase.regionserver.HRegion.releaseRowLock(HRegion.java:2639)
                  at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1658)
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1813)
                  ... 6 more
          

          From the logs of the RegionServer:

          2012-01-24 18:31:47,545 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Row lock -8669450923246717741 explicitly acquired by client
          2012-01-24 18:32:17,554 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: 
          java.lang.NullPointerException
          	at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:922)
          	at org.apache.hadoop.hbase.regionserver.HRegion.releaseRowLock(HRegion.java:2639)
          	at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1658)
          	at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1813)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
          	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348)
          2012-01-24 18:32:47,549 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Row Lock -8669450923246717741 lease expired
          2012-01-24 18:32:47,576 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"processingtimems":29166,"client":"10.4.13.49:47018","starttimems":1327429938406,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"put","totalColumns":1,"table":"tsdb-uid","families":{"id":[{"timestamp":-8669450923246717741,"qualifier":"metrics","vlen":8}]},"row":"\\x00"}
          
          Show
          Benoit Sigoure added a comment - I just ran into this again, still on JD's 0.92 test cluster. From the logs of OpenTSDB: 2012-01-24 18:32:17,605 ERROR [New I/O server worker #1-1] UniqueId: Put failed, attempts left=5 (retrying in 800 ms), put=PutRequest(table= "tsdb-uid" , key= "\x00" , family= "id" , qualifier="metrics ", value=[0, 0, 0, 0, 0, 0, 0, 1], lockid=-1, durable= true , bufferable= false , attempt=0, region=RegionInfo(table=" tsdb-uid ", region_name=" tsdb-uid,,1327429528678.c421780d32aae9959a1b821a441fca86. ", stop_key=" ")) org.hbase.async.RemoteException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1076) at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1065) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1815) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348) Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:922) at org.apache.hadoop.hbase.regionserver.HRegion.releaseRowLock(HRegion.java:2639) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1658) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1813) ... 6 more From the logs of the RegionServer: 2012-01-24 18:31:47,545 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Row lock -8669450923246717741 explicitly acquired by client 2012-01-24 18:32:17,554 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:922) at org.apache.hadoop.hbase.regionserver.HRegion.releaseRowLock(HRegion.java:2639) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1658) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1813) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1348) 2012-01-24 18:32:47,549 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Row Lock -8669450923246717741 lease expired 2012-01-24 18:32:47,576 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): { "processingtimems" :29166, "client" : "10.4.13.49:47018" , "starttimems" :1327429938406, "queuetimems" :0, "class" : "HRegionServer" , "responsesize" :0, "method" : "put" , "totalColumns" :1, "table" : "tsdb-uid" , "families" :{ "id" :[{ "timestamp" :-8669450923246717741, "qualifier" : "metrics" , "vlen" :8}]}, "row" : "\\x00" }
          Hide
          Ted Yu added a comment -
                Integer r = region.obtainRowLock(row);
          

          I think r was null before the following was called:

                long lockId = addRowLock(r, region);
          

          In a busy region, this could be due to the following:

                      if (!existingLatch.await(this.rowLockWaitDuration,
                                      TimeUnit.MILLISECONDS)) {
                          return null;
          

          We should check r against null in lockRow().

          Show
          Ted Yu added a comment - Integer r = region.obtainRowLock(row); I think r was null before the following was called: long lockId = addRowLock(r, region); In a busy region, this could be due to the following: if (!existingLatch.await( this .rowLockWaitDuration, TimeUnit.MILLISECONDS)) { return null ; We should check r against null in lockRow().
          Hide
          Ted Yu added a comment -

          This could be related to HBASE-5171.

          Show
          Ted Yu added a comment - This could be related to HBASE-5171 .

            People

            • Assignee:
              stack
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development