|
I have a more elegant work-around which doesn't involve deleting the data folders: edit the <hadoop-data-root>/dfs/data/current/VERSION file, changing the namespaceID to match the current namenode:
[jstehler@server19 ~]$ cat /lv_main/hadoop/dfs/data/current/VERSION This allowed me to bring up the slave datanode and have it recognized by the namenode in the DFS UI. I can confirm the bug, upgrading from 0.17 to 0.18.1 did not work. I even deleted the HDFS on nodes and formatted (I'm running a small 9-node cluster):
$ bin/hadoop/stop-all && cluster-fork rm -rf /state/partition1/hdfs/hadoop/* In addition, I tried a rough variant on Jared's solution that did not work either: $ cp /state/partition1/hdfs/hadoop/dfs/data/current/VERSION /shared/apps/VERSION Is there a reliable way to make it work right away ? Can this VERSION file (or namespaceID) be forced to be equal on every node ? Have had the same problem with version 0.19.0. On initial stage solved it deleting dfs.data.dir folders on the problematic datanodes and reformatting the namenode.
I saw this issue on our small 6-node cluster too. It took a while to identify the root cause of the problem. Symptoms were same as described here. In our case we have both 18 and 20 installed in our cluster, but we only run 20. A user saw the HDFS exception for their job, so they stopped 20 and thought of going back to 18 and tried to start it. And then they switched back to 20 again. In doing all this, version files of datanode and namenode got messed up and DNs n NN had different set of information in their version files. Apart from this peculiar usecase, as things are currently in hdfs, I think even one small misstep in upgrading the cluster can result in this bug, as is reported in previous comments. I think at the cluster startup time namenode and datanode should also exchange information contained in version file and in case of mismatch, they should reconcile the differences, potentially asking users input in case choices are not safe to make.
There are few workarounds suggested in previous comments. Which one of these is recommended one? |
||||||||||||||||||||||||||||||||||||||||||||||
Thanks.