Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27846

Eagerly compute Configuration.properties in sc.hadoopConfiguration

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Hadoop Configuration has an internal properties map which is lazily initialized. Initialization of this field, done in the private Configuration.getProps() method, is rather expensive because it ends up parsing XML configuration files. When cloning a Configuration, this properties field is cloned if it has been initialized.

      In some cases it's possible that sc.hadoopConfiguration never ends up computing this properties field, leading to performance problems when this configuration is cloned in SessionState.newHadoopConf() because each clone needs to re-parse configuration XML files from disk.

      To avoid this problem, we can call configuration.size() to trigger a call to getProps(), ensuring that this expensive computation is cached and re-used when cloning configurations.

      I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joshrosen Josh Rosen
                Reporter:
                joshrosen Josh Rosen
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: