[SPARK-20328] HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0, 2.1.1, 2.1.2
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

In order to obtain InputSplit information, HadoopRDD creates a MapReduce JobConf out of the Hadoop Configuration: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured (e.g. via yarn.resourcemanager.*). If it isn't, an exception is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler:

Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)

I have a workaround where I set a YARN-specific configuration variable to trick TokenCache into thinking YARN is configured, but this is obviously suboptimal.

The proper fix to this would likely require significant hadoop refactoring to make split information available without going through JobConf, so I'm not yet sure what the best course of action is.

Attachments

Issue Links

is related to

MAPREDUCE-6876 FileInputFormat.listStatus should not fetch delegation tokens

Open

relates to

SPARK-16742 Kerberos support for Spark on Mesos

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Michael Gummelt

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 13/Apr/17 21:21

Updated:: 21/May/19 04:12

Resolved:: 21/May/19 04:12