Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15732

EC client will not retry get block token when block token expired in kerberized cluster

    XMLWordPrintableJSON

Details

    Description

      When enable ec policy on hbase, we got some issues. Once block token was expired in datanode side, client side will not identify the InvalidToken error because of the SASL negotiation. As a result, ec client will not do retry by refetch token when create blockreader.  Then the peer datanode was added to DeadNodes, and all calls to function createBlockReader aim at this datanode in current DFSStripedInputStream will consider this datanode  was dead and return false. The finally result is a read failure.

      Some logs :

      hbase regionserver:

      2020-12-17 10:00:24,291 WARN [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: Failed to connect to /10.65.19.41:9866 for blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
      java.io.IOException: DIGEST-MD5: IO error acquiring password
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
      at org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
      at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
      at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
      at org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
      at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
      at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
      at org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
      at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
      at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
      at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
      at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
      at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031)
      at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408)
      at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388)
      at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926)

       

       

      datanode log:

      2020-12-17 10:00:24,290 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation src: /10.65.19.42:53894 dst: /10.65.19.41:9866
      javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, DS-a1ee0117-7430-4181-b279-3f1b16541567, DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.]
      at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
      at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, DS-a1ee0117-7430-4181-b279-3f1b16541567, DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.
      at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479)
      at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
      at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
      at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
      ... 7 more

       

      final result

       

       

      java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=7915, currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB, minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0], lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0], avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603, cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816]
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217)
      at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
      at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
      at org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
      at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
      at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
      at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
      at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
      at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
      at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: 3 missing blocks, the stripe is: AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0, missingChunksNum=3); locatedBlocks is: LocatedBlocks{; fileLength=22957603; underConstruction=false; blocks=[LocatedStripedBlock

      {BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750; getBlockSize()=22957603; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK], DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK], DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK], DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK], DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]]; indices=[0, 1, 2, 3, 4]}

      ]; lastLocatedBlock=LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750; getBlockSize()=22957603; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK], DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK], DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK], DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK], DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]]; indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true; ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs, numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
      at org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177)
      at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372)
      at org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324)
      at org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397)
      at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
      at java.io.DataInputStream.read(DataInputStream.java:149)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
      ... 13 more

      Attachments

        Issue Links

          Activity

            People

              lalapala gaozhan ding
              lalapala gaozhan ding
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m