Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-16604

Scanner retries on IOException can cause the scans to miss data

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Debugging an ITBLL failure, where the Verify did not "see" all the data in the cluster, I've noticed that if we end up getting a generic IOException from the HFileReader level, we may end up missing the rest of the data in the region. I was able to manually test this, and this stack trace helps to understand what is going on:

      2016-09-09 16:27:15,633 INFO  [hconnection-0x71ad3d8a-shared--pool21-t9] client.ScannerCallable(376): Open scanner=1 for scan={"loadColumnFamiliesOnDemand":null,"startRow":"","stopRow":"","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":2097152,"families":{"testFamily":["testFamily"]},"caching":100,"maxVersions":1,"timeRange":[0,9223372036854775807]} on region region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee., hostname=hw10676,51833,1473463626529, seqNum=2
      2016-09-09 16:27:15,634 INFO  [B.fifo.QRpcServer.handler=5,queue=0,port=51833] regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 100 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true renew: false
      2016-09-09 16:27:15,635 INFO  [B.fifo.QRpcServer.handler=5,queue=0,port=51833] regionserver.RSRpcServices(2510): Rolling back next call seqId
      2016-09-09 16:27:15,635 INFO  [B.fifo.QRpcServer.handler=5,queue=0,port=51833] regionserver.RSRpcServices(2565): Throwing new ServiceExceptionjava.io.IOException: Could not reseek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, currentSize=1567264, freeSize=1525578848, maxSize=1527146112, heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, avgValueLen=3, entries=17576, length=866998, cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
      2016-09-09 16:27:15,635 DEBUG [B.fifo.QRpcServer.handler=5,queue=0,port=51833] ipc.CallRunner(110): B.fifo.QRpcServer.handler=5,queue=0,port=51833: callId: 26 service: ClientService methodName: Scan size: 26 connection: 192.168.42.75:51903
      java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, currentSize=1567264, freeSize=1525578848, maxSize=1527146112, heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, avgValueLen=3, entries=17576, length=866998, cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
      	at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
      	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
      	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: foo bar baz
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
      	... 17 more
      2016-09-09 16:27:15,653 INFO  [hconnection-0x71ad3d8a-shared--pool21-t10] client.ScannerCallable(261): Got exception making request scanner_id: 1 number_of_rows: 100 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true renew: false to region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee., hostname=hw10676,51833,1473463626529, seqNum=2
      java.io.IOException: java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, currentSize=1567264, freeSize=1525578848, maxSize=1527146112, heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, avgValueLen=3, entries=17576, length=866998, cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
      	at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
      	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
      	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: foo bar baz
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
      	... 17 more
      
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
      	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
      	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
      	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:332)
      	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:256)
      	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
      	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
      	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:1)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
      	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, currentSize=1567264, freeSize=1525578848, maxSize=1527146112, heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, avgValueLen=3, entries=17576, length=866998, cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
      	at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
      	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
      	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
      	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
      	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
      	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: foo bar baz
      	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
      	... 17 more
      
      	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1225)
      	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
      	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
      	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:214)
      	... 9 more
      2016-09-09 16:27:15,963 INFO  [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] regionserver.RSRpcServices(2196): scan request:region { type: REGION_NAME value: "hbase:meta,,1" } scan { start_row: "testScanThrowsException,,99999999999999" max_versions: 1 cache_blocks: true small: true reversed: true caching: 1 } number_of_rows: 1 close_scanner: true client_handles_partials: true client_handles_heartbeats: true
      2016-09-09 16:27:15,965 INFO  [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] regionserver.RSRpcServices(2549): class org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:cells_per_result: 4 scanner_id: 15 more_results: false stale: false partial_flag_per_result: false more_results_in_region: true heartbeat_message: false
      2016-09-09 16:27:15,967 INFO  [B.fifo.QRpcServer.handler=0,queue=0,port=51833] regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 100 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true renew: false
      2016-09-09 16:27:15,967 INFO  [B.fifo.QRpcServer.handler=0,queue=0,port=51833] regionserver.RSRpcServices(2549): class org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id: 1 more_results: true ttl: 60000 stale: false more_results_in_region: false
      2016-09-09 16:27:15,968 INFO  [B.fifo.QRpcServer.handler=1,queue=0,port=51833] regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 0 close_scanner: true client_handles_partials: true client_handles_heartbeats: true
      2016-09-09 16:27:15,968 INFO  [B.fifo.QRpcServer.handler=1,queue=0,port=51833] regionserver.RSRpcServices(2549): class org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id: 1 more_results: false
      2016-09-09 16:27:15,971 INFO  [main] hbase.HBaseTestingUtility(1111): Shutting down minicluster
      

      The ScannerCallable is retryable if it gets a IOException. Before the HBASE-13662 fix, the retry was still failing due to OutOfOrderScannerException, which then causes the Scanner stack to be reset, thus masking the issue. After HBASE-13662, we can retry the same scan, but in case of we are doing a reseek() which fails in middle with an IOException, the KVHeaps and the Scanner state on the server side is left dirty.

      It seems to be much safer to reset the scanner state in case of IOExceptions.

      Attachments

        1. HBASE-16604-branch-1.3-addendum.patch
          0.9 kB
          Jerry He
        2. hbase-16604_v3.branch-1.patch
          32 kB
          Enis Soztutar
        3. hbase-16604_v3.patch
          33 kB
          Enis Soztutar
        4. hbase-16604_v2.patch
          33 kB
          Enis Soztutar
        5. hbase-16604_v1.patch
          29 kB
          Enis Soztutar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            enis Enis Soztutar
            enis Enis Soztutar
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment