[SPARK-27846] Eagerly compute Configuration.properties in sc.hadoopConfiguration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: Spark Core
Labels:
None

Description

Hadoop Configuration has an internal properties map which is lazily initialized. Initialization of this field, done in the private Configuration.getProps() method, is rather expensive because it ends up parsing XML configuration files. When cloning a Configuration, this properties field is cloned if it has been initialized.

In some cases it's possible that sc.hadoopConfiguration never ends up computing this properties field, leading to performance problems when this configuration is cloned in SessionState.newHadoopConf() because each clone needs to re-parse configuration XML files from disk.

To avoid this problem, we can call configuration.size() to trigger a call to getProps(), ensuring that this expensive computation is cached and re-used when cloning configurations.

I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload.

Attachments

Issue Links

links to

GitHub Pull Request #24714

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/May/19 23:45

Updated:: 30/Dec/19 03:20

Resolved:: 10/Jun/19 01:00