Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27846

Eagerly compute Configuration.properties in sc.hadoopConfiguration

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • Spark Core
    • None

    Description

      Hadoop Configuration has an internal properties map which is lazily initialized. Initialization of this field, done in the private Configuration.getProps() method, is rather expensive because it ends up parsing XML configuration files. When cloning a Configuration, this properties field is cloned if it has been initialized.

      In some cases it's possible that sc.hadoopConfiguration never ends up computing this properties field, leading to performance problems when this configuration is cloned in SessionState.newHadoopConf() because each clone needs to re-parse configuration XML files from disk.

      To avoid this problem, we can call configuration.size() to trigger a call to getProps(), ensuring that this expensive computation is cached and re-used when cloning configurations.

      I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload.

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: