Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21607

HBase Region server read fails in case of datanode disk error.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3.2
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      Hbase region server reads failed with following error.

      2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] hdfs.BlockReaderFactory - BlockReaderFactory(fileName=<fileName>, block=BP-1618467445-<ip>-1516873463430:blk_1090719164_16982933): error creating ShortCircuitReplica.
      java.io.IOException: invalid metadata header version 0. Can only handle version 1.
      at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:129)
      at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:558)
      at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:490)
      at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:782)
      at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:716)
      at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
      at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
      at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1145)
      at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1087)
      at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1444)
      at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1407)
      at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:89)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:834)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1530)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1781)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220)
      at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890)
      at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739)
      at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149)
      at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250)
      at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068)
      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373)
      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
      
       
      
      2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] shortcircuit.ShortCircuitCache - ShortCircuitCache(0x16fd768f): failed to load 1090719164_BP-1618467445-<ip>-1516873463430
      2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 service: ClientService methodName: Get size: 443 connection: <ip>:48798 deadline: 1543596678759
      2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 service: ClientService methodName: Get size: 443 connection: <ip>:48798 deadline: 1543596678759
      java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://<fileName>, compression=none, cacheConf=bl
      ockCache=LruBlockCache{blockCount=696858, currentSize=7625195376, freeSize=5689203856, maxSize=13314399232, heapSize=7625195376, minSize=12648679424, minFactor=0.95, multiSize=6324339712, multiFactor=0.5, singleSize=3162169856, singleFactor=0.25}, cacheDa
      taOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=<keyName>
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:207)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220)
      at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890)
      at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739)
      at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156)
      at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149)
      at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250)
      at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068)
      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373)
      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
      Caused by: java.io.IOException: Invalid HFile block magic: \x00\x00\x00\x00\x00\x00\x00\x00
      at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:155)
      at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:167)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock.<init>(HFileBlock.java:345)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1821)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189)
      ... 16 more
      
      
      

      TheĀ "Invalid HFile block magic" exception was thrown all the way back to the client.

      The first stack trace indicates that hdfs short circuit read failed and then it fell back to hdfs local rpc read and that succeeded.

      The underlying reason is there was a write error in one of the disks. The datanode was receiving and mirroring(to downstream nodes in the pipeline) just fine but was writing garbage data(all \x00\x00\x00\x00\x00\x00\x00\x00) to disk. Datanode hasn't detected this disk as bad yet.

      We have hbase checksum turned on and hdfs checksum evaluation turned off. While reading this erroneous block, datanode didn't do any checksum validation and streamed all the data as is to region server and relied on hbase checksum validation to catch it. But we have a special case in hbase checksum validation that if the checksumType byte is 0, then it is NULL and considers that checksum validation is not needed for this block. Refer to ChecksumUtil and ChecksumType.
      Thanks [~lhofhansl] and Karan Mehta for help debugging this.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              shahrs87 Rushabh Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated: