When we did an upgrade from 2.5 to 2.6 in a medium size cluster, about 4% of datanodes were not coming up. They treid data file layout upgrade for BLOCKID_BASED_LAYOUT introduced in
HDFS-6482, but failed.
All failures were caused by NativeIO.link() throwing IOException saying EEXIST. The data nodes didn't die right away, but the upgrade was soon retried when the block pool initialization was retried whenever BPServiceActor was registering with the namenode. After many retries, datenodes terminated. This would leave previous.tmp and current with no VERSION file in the block pool slice storage directory.
Although previous.tmp contained the old VERSION file, the content was in the new layout and the subdirs were all newly created ones. This shouldn't have happened because the upgrade-recovery logic in Storage removes current and renames previous.tmp to current before retrying. All successfully upgraded volumes had old state preserved in their previous directory.
In summary there were two observed issues.
- Upgrade failure with link() failing with EEXIST
- previous.tmp contained not the content of original current, but half-upgraded one.
We did not see this in smaller scale test clusters.