Description
I populated a SparkConf object passed to a SparkContext with some spark.hadoop.* configurations, expecting them to be used in the backing Hadoop file reading whenever I read from my DFS. However, when I was running some jobs, I noticed that the configurations were not being properly applied to the data frame reading when I used sqlContext.read().parquet().
I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading; instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object won't have Hadoop configurations set on the Spark Context. In general it seems like we have a discrepancy in how we set Hadoop configurations; when reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop configuration from the Spark Context, but for Data Frames we use SparkHadoopUtil.conf.
We should probably use the Spark Context hadoop configuration for Data Frames as well.
Attachments
Issue Links
- duplicates
-
SPARK-14912 Propagate data source options to Hadoop configurations
- Resolved
- is related to
-
SPARK-14913 Simplify configuration API
- Resolved