Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34346

io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.0.2, 3.1.1
    • Fix Version/s: 3.0.2, 3.1.1
    • Component/s: Spark Core, SQL
    • Labels:
      None

      Description

      In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

      1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

      2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

        Attachments

          Activity

            People

            • Assignee:
              Qin Yao Kent Yao
              Reporter:
              Qin Yao Kent Yao
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: