[SPARK-5655] YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.1
Fix Version/s: 1.2.2, 1.3.0
Component/s: YARN
Labels:
- hadoop
Environment:

Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2

Target Version/s:

1.2.2, 1.3.0

Description

When running a Spark job on a YARN cluster which doesn't run containers under the same user as the nodemanager, and also when using the YARN auxiliary shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index (Permission denied)

The root cause of this here: https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during the job, which includes files created in the nodemanager's usercache directory. The owner of these files is the container UID, which on a secure cluster is the name of the user creating the job, and on an nonsecure cluster but with the yarn.nodemanager.container-executor.class configured is the value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the nodemanager, which is typically running as the user 'yarn'. This can't access these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them readable by the nodemanager, by setting the group of the appcache directory to 'yarn' and also setting the setgid flag. This means that files and directories created under this should also have the 'yarn' group. Normally this means that the nodemanager should also be able to read these files, but Spark setting chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 functionality makes this work on YARN, and still makes the application files only readable by the owner and the group:

/data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-----  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to suggest a patch.

I believe this is a related issue in the MapReduce framwork: https://issues.apache.org/jira/browse/MAPREDUCE-3728

Attachments

Issue Links

links to

[Github] Pull Request #4509 (growse)

Activity

People

Assignee:: Andrew Rowson

Reporter:: Andrew Rowson

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Feb/15 18:49

Updated:: 12/Feb/15 19:13

Resolved:: 12/Feb/15 18:45