Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13912

spark.hadoop.* configurations are not applied for Parquet Data Frame Readers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.1
    • None
    • SQL
    • None

    Description

      I populated a SparkConf object passed to a SparkContext with some spark.hadoop.* configurations, expecting them to be used in the backing Hadoop file reading whenever I read from my DFS. However, when I was running some jobs, I noticed that the configurations were not being properly applied to the data frame reading when I used sqlContext.read().parquet().

      I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading; instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object won't have Hadoop configurations set on the Spark Context. In general it seems like we have a discrepancy in how we set Hadoop configurations; when reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop configuration from the Spark Context, but for Data Frames we use SparkHadoopUtil.conf.

      We should probably use the Spark Context hadoop configuration for Data Frames as well.

      Attachments

        Issue Links

          Activity

            People

              rxin Reynold Xin
              mcheah Matt Cheah
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: