Hadoop Configuration has an internal properties map which is lazily initialized. Initialization of this field, done in the private Configuration.getProps() method, is rather expensive because it ends up parsing XML configuration files. When cloning a Configuration, this properties field is cloned if it has been initialized.
In some cases it's possible that sc.hadoopConfiguration never ends up computing this properties field, leading to performance problems when this configuration is cloned in SessionState.newHadoopConf() because each clone needs to re-parse configuration XML files from disk.
To avoid this problem, we can call configuration.size() to trigger a call to getProps(), ensuring that this expensive computation is cached and re-used when cloning configurations.
I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload.