Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20328

HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0, 2.1.1, 2.1.2
    • None
    • Spark Core

    Description

      In order to obtain InputSplit information, HadoopRDD creates a MapReduce JobConf out of the Hadoop Configuration: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

      Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured (e.g. via yarn.resourcemanager.*). If it isn't, an exception is thrown.

      So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler:

      Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
      	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
      	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
      	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
      

      I have a workaround where I set a YARN-specific configuration variable to trick TokenCache into thinking YARN is configured, but this is obviously suboptimal.

      The proper fix to this would likely require significant hadoop refactoring to make split information available without going through JobConf, so I'm not yet sure what the best course of action is.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mgummelt Michael Gummelt
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: