Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14311

Multi-threading conflict at layoutVersion when loading block pool storage

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at StorageInfo.layoutVersion in loading block pool storage process.

      It will cause this exception:

       

      exceptions

      2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] - Restored 36974 block files from trash before the layout upgrade. These blocks will be moved to the previous directory during the upgrade
      2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] - Failed to analyze storage directories for block pool BP-1216718839-10.120.232.23-1548736842023
      java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
      at java.lang.Thread.run(Thread.java:748)
      2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block pool BP-1216718839-10.120.232.23-1548736842023
      java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
      at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
      at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
      at java.lang.Thread.run(Thread.java:748)

       

      root cause:

      BlockPoolSliceStorage instance is shared for all storage locations recover transition. In BlockPoolSliceStorage.doTransition, it will read the old layoutVersion from local storage, compare with current DataNode version, then do upgrade. In doUpgrade, add the transition work as a sub-thread, the transition work will set the BlockPoolSliceStorage's layoutVersion to current DN version. The next storage dir transition check will concurrent with pre storage dir real transition work, then the BlockPoolSliceStorage instance layoutVersion will confusion.

       

      Attachments

        1. HDFS-14311.1.patch
          2 kB
          Yicong Cai
        2. HDFS-14311.2.patch
          2 kB
          Yicong Cai
        3. HDFS-14311.branch-2.1.patch
          3 kB
          Yicong Cai

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            caiyicong Yicong Cai
            caiyicong Yicong Cai
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment