Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6137

Datanode cannot rollback because LayoutVersion incorrect

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.5-alpha
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      upgrade from hadoop-2.0.5-alpha(QJM HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

      2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
      org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
              at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
              at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
              at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
              at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
              at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
              at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
              at java.lang.Thread.run(Thread.java:744)
      

      I looked at the datanode dir, $datanode.dir/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          > but cannot change the (old) softwares to fix bug.

          what did you mean?

          In this case, the old software we tried to rollback is hadoop-2.0.5-alpha. The bug actually is in hadoop-2.0.5-alpha but not trunk.

          Show
          Tsz Wo Nicholas Sze added a comment - > but cannot change the (old) softwares to fix bug. what did you mean? In this case, the old software we tried to rollback is hadoop-2.0.5-alpha. The bug actually is in hadoop-2.0.5-alpha but not trunk.
          Hide
          Fengdong Yu added a comment -

          Thanks Tsz Wo Nicholas Sze, a very useful description.

          I think we only can advise users to do manually rollback (manually change the data/current/VERSION file to the old version)

          Yes. but for a large cluster, there need some addtional work to do.

          but cannot change the (old) softwares to fix bug.

          what did you mean?

          Show
          Fengdong Yu added a comment - Thanks Tsz Wo Nicholas Sze , a very useful description. I think we only can advise users to do manually rollback (manually change the data/current/VERSION file to the old version) Yes. but for a large cluster, there need some addtional work to do. but cannot change the (old) softwares to fix bug. what did you mean?
          Hide
          Arpit Agarwal added a comment -

          Thanks for the explaining that Nicholas, that makes sense.

          Show
          Arpit Agarwal added a comment - Thanks for the explaining that Nicholas, that makes sense.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I think this is fixed in trunk now after HDFS-6800 and HDFS-6981.

          This problem probably is different than HDFS-6800 and HDFS-6981.

          Let me give some background. Federation added block pools to the datanode data directory. The directory structure becomes

          data +- current  +- pool_1 +- current
               |           |         +- previous
               |           |        
               |           +- pool_2 +- current
               |                     +- previous
               |
               +- previous
          

          Then, we have two level VERSION files, data/current/VERSION and data/current/poo1_x/current/VERSION. During upgrade, both VERSION files are overwritten to the new versions. For rollback, since we may only rollback an individual block pool, only data/current/poo1_x/current/VERSION is restored but not data/current/VERSION. Then, we will get version mismatched.

          We found the problem in HDFS-5526. At that time we added code to overwrite the data/current/VERSION file during rollback. It worked fine.

          However, for the software versions with Federation but without HDFS-5526, they still have the problem so that they cannot rollback. This is the bug described here.

          I think we only can advise users to do manually rollback (manually change the data/current/VERSION file to the old version) but cannot change the (old) softwares to fix bug.

          Show
          Tsz Wo Nicholas Sze added a comment - > I think this is fixed in trunk now after HDFS-6800 and HDFS-6981 . This problem probably is different than HDFS-6800 and HDFS-6981 . Let me give some background. Federation added block pools to the datanode data directory. The directory structure becomes data +- current +- pool_1 +- current | | +- previous | | | +- pool_2 +- current | +- previous | +- previous Then, we have two level VERSION files, data/current/VERSION and data/current/poo1_x/current/VERSION. During upgrade, both VERSION files are overwritten to the new versions. For rollback, since we may only rollback an individual block pool, only data/current/poo1_x/current/VERSION is restored but not data/current/VERSION. Then, we will get version mismatched. We found the problem in HDFS-5526 . At that time we added code to overwrite the data/current/VERSION file during rollback. It worked fine. However, for the software versions with Federation but without HDFS-5526 , they still have the problem so that they cannot rollback. This is the bug described here. I think we only can advise users to do manually rollback (manually change the data/current/VERSION file to the old version) but cannot change the (old) softwares to fix bug.
          Tsz Wo Nicholas Sze made changes -
          Affects Version/s 2.0.5-alpha [ 12324428 ]
          Affects Version/s 2.4.0 [ 12326143 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ..., the Affects Version/s should be 2.0.5 right? ...

          Sure. Just have updated it.

          Show
          Tsz Wo Nicholas Sze added a comment - > ..., the Affects Version/s should be 2.0.5 right? ... Sure. Just have updated it.
          Hide
          Arpit Agarwal added a comment -

          I think this is fixed in trunk now after HDFS-6800 and HDFS-6981.

          Tsz Wo Nicholas Sze, could you please confirm?

          Show
          Arpit Agarwal added a comment - I think this is fixed in trunk now after HDFS-6800 and HDFS-6981 . Tsz Wo Nicholas Sze , could you please confirm?
          Hide
          Suresh Srinivas added a comment -

          Tsz Wo Nicholas Sze, the Affects Version/s should be 2.0.5 right? This is a problem in rollback code from 2.0.5 and not from 2.4.0.

          Show
          Suresh Srinivas added a comment - Tsz Wo Nicholas Sze , the Affects Version/s should be 2.0.5 right? This is a problem in rollback code from 2.0.5 and not from 2.4.0.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think it probably first called doRollback(..) and then readProperties(..) as shown below.

          // BlockPoolSliceStorage.doTransition(..)
              if (startOpt == StartupOption.ROLLBACK) {
                doRollback(sd, nsInfo); // rollback if applicable
              } else {
                // Restore all the files in the trash. The restored files are retained
                // during rolling upgrade rollback. They are deleted during rolling
                // upgrade downgrade.
                int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
                LOG.info("Restored " + restored + " block files from trash.");
              }
              readProperties(sd);
          
          Show
          Tsz Wo Nicholas Sze added a comment - I think it probably first called doRollback(..) and then readProperties(..) as shown below. // BlockPoolSliceStorage.doTransition(..) if (startOpt == StartupOption.ROLLBACK) { doRollback(sd, nsInfo); // rollback if applicable } else { // Restore all the files in the trash. The restored files are retained // during rolling upgrade rollback. They are deleted during rolling // upgrade downgrade. int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd)); LOG.info( "Restored " + restored + " block files from trash." ); } readProperties(sd);
          Hide
          Fengdong Yu added a comment -

          Tsz Wo Nicholas Sze, I found BlockPoolSliceStorage.doRollback() is not called during DN start with -rollback in the Exception.

                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
          

          doTransition() should call doRollback(), right?

          Show
          Fengdong Yu added a comment - Tsz Wo Nicholas Sze , I found BlockPoolSliceStorage.doRollback() is not called during DN start with -rollback in the Exception. at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145) doTransition() should call doRollback(), right?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I found this problem earlier and filed HDFS-5526. It seems that HDFS-5526 did not completely solve the problem.

          Show
          Tsz Wo Nicholas Sze added a comment - I found this problem earlier and filed HDFS-5526 . It seems that HDFS-5526 did not completely solve the problem.
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-5526 [ HDFS-5526 ]
          Fengdong Yu made changes -
          Description upgrade from hadoop-2.0.5-alpha(HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

          {code}
          2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
          org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
                  at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
                  at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
                  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
                  at java.lang.Thread.run(Thread.java:744)
          {code}
            
          I looked at the datanode dir, $datanode.dir/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.
          upgrade from hadoop-2.0.5-alpha(QJM HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

          {code}
          2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
          org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
                  at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
                  at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
                  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
                  at java.lang.Thread.run(Thread.java:744)
          {code}
            
          I looked at the datanode dir, $datanode.dir/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.
          Fengdong Yu made changes -
          Description upgrade from hadoop-2.0.5-alpha(HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

          {code}
          2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
          org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
                  at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
                  at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
                  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
                  at java.lang.Thread.run(Thread.java:744)
          {code}
            
          I looked at the datanode dir, {datanode.dir}/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.
          upgrade from hadoop-2.0.5-alpha(HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

          {code}
          2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
          org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
                  at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
                  at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
                  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
                  at java.lang.Thread.run(Thread.java:744)
          {code}
            
          I looked at the datanode dir, $datanode.dir/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.
          Fengdong Yu made changes -
          Field Original Value New Value
          Description upgrade from hadoop-2.0.5-alpha(HA enabled) to the lastest trunk(HA disabled), which is successful. then stop the cluster, and rollback, then it throw exception:

          {code}
          2014-03-21 18:33:19,384 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1123524590-10.204.8.135-1395397158134 (storage id DS-1123524590-10.204.8.135-50010-1395397185148) service to 10-204-8-135/10.204.8.135:9000
          org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hdfs/data/current/BP-1123524590-10.204.8.135-1395397158134. Reported: -55. Expecting = -40.
                  at org.apache.hadoop.hdfs.server.common.Storage.setLayoutVersion(Storage.java:1083)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:217)
                  at org.apache.hadoop.hdfs.server.common.Storage.readProperties(Storage.java:922)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:244)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:145)
                  at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:913)
                  at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:884)
                  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
                  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
                  at java.lang.Thread.run(Thread.java:744)
          {code}
            
          I looked at the datanode dir, {datanode.dir}/VERSION is always new, when we upgrade, this file was overwrited, so it MUST fail during rollback.
          Fengdong Yu created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Fengdong Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development