Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14311

multi-threading conflict at layoutVersion when loading block pool storage

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.9.2
    • Fix Version/s: None
    • Component/s: rolling upgrades
    • Labels:
      None

      Description

      When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at StorageInfo.layoutVersion in loading block pool storage process.

      It will cause this exception:

       

      exceptions

      2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] - Restored 36974 block files from trash before the layout upgrade. These blocks will be moved to the previous directory during the upgrade
      2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] - Failed to analyze storage directories for block pool BP-1216718839-10.120.232.23-1548736842023
      java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
      at java.lang.Thread.run(Thread.java:748)
      2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block pool BP-1216718839-10.120.232.23-1548736842023
      java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
      at java.lang.Thread.run(Thread.java:748)

       

      root cause:

      BlockPoolSliceStorage instance is shared for all storage locations recover transition. In BlockPoolSliceStorage.doTransition, it will read the old layoutVersion from local storage, compare with current DataNode version, then do upgrade. In doUpgrade, add the transition work as a sub-thread, the transition work will set the BlockPoolSliceStorage's layoutVersion to current DN version. The next storage dir transition check will concurrent with pre storage dir real transition work, then the BlockPoolSliceStorage instance layoutVersion will confusion.

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              caiyicong Yicong Cai
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: