Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17608

Datanodes Decommissioning hang forever if the node under decommissioning has disk media error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.6
    • None
    • datanode
    • None
    • Redhat 8.7, Hadoop 3.3.6

    Description

      The blocks on the decommissioning datanode are all EC striped block. The decommissioning progress hangs forever and keeping output these logs:

       

      2024-08-26 10:31:14,748 WARN  datanode.DataNode (DataNode.java:run(2927)) - DatanodeRegistration(10.18.130.251:1019, datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe
      curePort=0, ipcPort=8010, storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed to transfer BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to x.x.x.x:1019 got
      java.io.IOException: Input/output error
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:255)
              at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
              at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:750)
      2024-08-26 10:31:14,758 WARN  datanode.DataNode (BlockSender.java:readChecksum(693)) -  Could not read or failed to verify checksum for data at offset 10878976 for block BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731
      java.io.IOException: Input/output error
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:255)
              at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
              at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
              at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:750)

      The namenode outputs:

      2024-08-26 10:39:13,404 INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 10.18.130.251:1019, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
      2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still has 3 blocks to replicate before it is a candidate to finish Decommission In Progress.
      2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending.

      The block (blk_-9223372036501683307_35436379) that datanode is trying to access is in the disk which has media error.  The dmesg keeps saying:

      [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error [current] 
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional sense information
      [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 00 03 06 09 e3 b0 00 00 00 08 00 00

       

      When I try to cp this block file, i get error:

        cp  blk_-9223372036827280880_3990731.meta /opt
        cp: error reading 'blk_-9223372036827280880_3990731.meta': Input/output error

      Attachments

        Activity

          People

            Unassigned Unassigned
            jacklove2run Jack Yang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: