Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8341

HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 2.6.0
    • None
    • balancer & mover
    • None
    • HDP 2.2

    Description

      HDFS mover gets stuck looping on a block that fails to move and doesn't migrate the rest of the blocks.

      This is preventing recovery of data from a decomissioning external storage tier used for archive (we've had problems with that proprietary "hyperscale" storage product which is why a couple blocks here and there have checksum problems or premature eof as shown below), but this should not prevent moving all the other blocks to recover our data:

      hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
      15/05/07 14:52:50 INFO mover.Mover: namenodes = {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
      15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
      15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
      15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
      15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception java.io.EOFException: Premature EOF: no length prefix available
      <NOW IT STARTS LOOPING ON SAME BLOCK>
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
      15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception java.io.EOFException: Premature EOF: no length prefix available
      ...<repeat indefinitely>...
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            harisekhon Hari Sekhon
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: