Description
The visibilities of the distributed cache files and archives are currently determined by the permission of these files or archives.
The following is the logic of method isPublic() in class ClientDistributedCacheManager:
static boolean isPublic(Configuration conf, URI uri, Map<URI, FileStatus> statCache) throws IOException { FileSystem fs = FileSystem.get(uri, conf); Path current = new Path(uri.getPath()); //the leaf level file should be readable by others if (!checkPermissionOfOther(fs, current, FsAction.READ, statCache)) { return false; } return ancestorsHaveExecutePermissions(fs, current.getParent(), statCache); }
At NodeManager side, it will use "yarn" user to download public files and use the user who submits the job to download private files. In normal cases, there is no problem with this. However, if the files are located in an encryption zone(HDFS-6134) and yarn user are configured to be disallowed to fetch the DataEncryptionKey(DEK) of this encryption zone by KMS, the download process of this file will fail.
You can reproduce this issue with the following steps (assume you submit job with user "testUser"):
- create a clean cluster which has HDFS cryptographic FileSystem feature
- create directory "/data/" in HDFS and make it as an encryption zone with keyName "testKey"
- configure KMS to only allow user "testUser" can decrypt DEK of key "testKey" in KMS
<property> <name>key.acl.testKey.DECRYPT_EEK</name> <value>testUser</value> </property>
- execute job "teragen" with user "testUser":
su -s /bin/bash testUser -c "hadoop jar hadoop-mapreduce-examples*.jar teragen 10000 /data/terasort-input"
- execute job "terasort" with user "testUser":
su -s /bin/bash testUser -c "hadoop jar hadoop-mapreduce-examples*.jar terasort /data/terasort-input /data/terasort-output"
You will see logs like this at the job submitter's console:
INFO mapreduce.Job: Job job_1416860917658_0002 failed with state FAILED due to: Application application_1416860917658_0002 failed 2 times due to AM Container for appattempt_1416860917658_0002_000002 exited with exitCode: -1000 due to: org.apache.hadoop.security.authorize.AuthorizationException: User [yarn] is not authorized to perform [DECRYPT_EEK] on key with ACL name [testKey]!!
The initial idea to solve this issue is to modify the logic in ClientDistributedCacheManager.isPublic to consider also whether this file is in an encryption zone. If it is in an encryption zone, this file should be considered as private. Then at NodeManager side, it will use user who submits the job to fetch the file.