Details
Description
We migrated our NameNodes from low configuration to high configuration machines last week. Firstly,we imported the current directory including fsimage and editlog files from original ActiveNameNode to new ActiveNameNode and started the New NameNode, then changed the configuration of all datanodes and restarted all of datanodes , then blockreport to new NameNodes at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , most of the users compained that their jobs couldn't be executed for the reason of permission problem.
We applied Acls in our clusters, and after migrated we found most of the directories and files which were not set Acls before now had the properties of Acls. That is the reason why users could not execute their jobs.So we had to change most of the files permission to a+r and directories permission to a+rx to make sure the jobs can be executed.
After searching this problem for some days, i found there is a bug in FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the proper value in logMkdir and logOpenFile functions. Here is the code of logMkdir:
public void logMkDir(String path, INode newNode) {
PermissionStatus permissions = newNode.getPermissionStatus();
MkdirOp op = MkdirOp.getInstance(cache.get())
.setInodeId(newNode.getId())
.setPath(path)
.setTimestamp(newNode.getModificationTime())
.setPermissionStatus(permissions);
AclFeature f = newNode.getAclFeature();
if (f != null)
logEdit(op);
}
For example, if we mkdir with Acls through one handler(Thread indeed), we set the AclEntries to the op from the cache. After that, if we mkdir without any Acls setting and set through the same handler, the AclEnties from the cache is the same with the last one which set the Acls, and because the newNode have no AclFeature, we don’t have any chance to change it. Then the editlog is wrong,record the wrong Acls. After the Standby load the editlogs from journalnodes and apply them to memory in SNN then savenamespace and transfer the wrong fsimage to ANN, all the fsimages get wrong. The only solution is to save namespace from ANN and you can get the right fsimage.
Attachments
Attachments
Issue Links
- is related to
-
HDFS-7398 Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
- Closed