[SPARK-13912] spark.hadoop.* configurations are not applied for Parquet Data Frame Readers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

I populated a SparkConf object passed to a SparkContext with some spark.hadoop.* configurations, expecting them to be used in the backing Hadoop file reading whenever I read from my DFS. However, when I was running some jobs, I noticed that the configurations were not being properly applied to the data frame reading when I used sqlContext.read().parquet().

I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading; instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object won't have Hadoop configurations set on the Spark Context. In general it seems like we have a discrepancy in how we set Hadoop configurations; when reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop configuration from the Spark Context, but for Data Frames we use SparkHadoopUtil.conf.

We should probably use the Spark Context hadoop configuration for Data Frames as well.

Attachments

Issue Links

duplicates

SPARK-14912 Propagate data source options to Hadoop configurations

Resolved

is related to

SPARK-14913 Simplify configuration API

Resolved

Activity

People

Assignee:: Reynold Xin

Reporter:: Matt Cheah

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Mar/16 17:45

Updated:: 02/May/16 19:42

Resolved:: 02/May/16 19:42