Issue Details (XML | Word | Printable)

Key: HDFS-107
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Konstantin Shvachko
Votes: 8
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Hadoop HDFS

Data-nodes should be formatted when the name-node is formatted.

Created: 05/Apr/07 08:49 PM   Updated: 20/Oct/09 10:26 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified


 Description  « Hide
The upgrade feature HADOOP-702 requires data-nodes to store persistently the namespaceID
in their version files and verify during startup that it matches the one stored on the name-node.
When the name-node reformats it generates a new namespaceID.
Now if the cluster starts with the reformatted name-node, and not reformatted data-nodes
the data-nodes will fail with
java.io.IOException: Incompatible namespaceIDs ...

Data-nodes should be reformatted whenever the name-node is. I see 2 approaches here:
1) In order to reformat the cluster we call "start-dfs -format" or make a special script "format-dfs".
This would format the cluster components all together. The question is whether it should start
the cluster after formatting?
2) Format the name-node only. When data-nodes connect to the name-node it will tell them to
format their storage directories if it sees that the namespace is empty and its cTime=0.
The drawback of this approach is that we can loose blocks of a data-node from another cluster
if it connects by mistake to the empty name-node.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stu Hood added a comment - 25/Jul/07 07:32 PM - edited
Does anyone have any thoughts on this issue? I've been getting "Incompatible namespaceID" errors on my datanodes after formatting with `bin/hadoop namenode format`. My current solution is to remove the hadoop-*-data directory on each datanode, but there ought to be a better way.

Thanks.


Jared Stehler added a comment - 05/Aug/08 06:06 PM
I have a more elegant work-around which doesn't involve deleting the data folders: edit the <hadoop-data-root>/dfs/data/current/VERSION file, changing the namespaceID to match the current namenode:

[jstehler@server19 ~]$ cat /lv_main/hadoop/dfs/data/current/VERSION
#Fri Aug 01 18:40:43 UTC 2008
namespaceID=292609117
storageID=DS-1525930547-66.135.42.149-50010-1217002151282
cTime=0
storageType=DATA_NODE
layoutVersion=-11

This allowed me to bring up the slave datanode and have it recognized by the namenode in the DFS UI.


Roman Valls added a comment - 25/Nov/08 01:11 PM
I can confirm the bug, upgrading from 0.17 to 0.18.1 did not work. I even deleted the HDFS on nodes and formatted (I'm running a small 9-node cluster):

$ bin/hadoop/stop-all && cluster-fork rm -rf /state/partition1/hdfs/hadoop/*
$ hadoop namenode -format

In addition, I tried a rough variant on Jared's solution that did not work either:

$ cp /state/partition1/hdfs/hadoop/dfs/data/current/VERSION /shared/apps/VERSION
$ cluster-fork cp -a /shared/apps/VERSION /state1/partition1/hdfs/hadoop/dfs/data/current/VERSION

Is there a reliable way to make it work right away ? Can this VERSION file (or namespaceID) be forced to be equal on every node ?


Andrii Vozniuk added a comment - 09/Feb/09 08:11 AM - edited
Have had the same problem with version 0.19.0. On initial stage solved it deleting dfs.data.dir folders on the problematic datanodes and reformatting the namenode.

Ashutosh Chauhan added a comment - 20/Oct/09 10:26 PM
I saw this issue on our small 6-node cluster too. It took a while to identify the root cause of the problem. Symptoms were same as described here. In our case we have both 18 and 20 installed in our cluster, but we only run 20. A user saw the HDFS exception for their job, so they stopped 20 and thought of going back to 18 and tried to start it. And then they switched back to 20 again. In doing all this, version files of datanode and namenode got messed up and DNs n NN had different set of information in their version files. Apart from this peculiar usecase, as things are currently in hdfs, I think even one small misstep in upgrading the cluster can result in this bug, as is reported in previous comments. I think at the cluster startup time namenode and datanode should also exchange information contained in version file and in case of mismatch, they should reconcile the differences, potentially asking users input in case choices are not safe to make.

There are few workarounds suggested in previous comments. Which one of these is recommended one?