In certain Hadoop deployments the HADOOP_CONF_DIR environment variable could contain multiple paths. For example, when running spark-submit Cloudera 6.3 sets it as follows:
Currently the class HadoopFileSystemOptions reads the content of the variable but treats it as a single path. When it contains multiple paths, this makes Beam unable to properly configure Hadoop, and so HDFS can't be accessed. At the moment, the only work arounds to make it work that I'm aware of are:
- Override the HADOOP_CONF_DIR set by Cloudera for the Spark service, but I think it could cause problems with some other tools (maybe when using Hive from Spark, because I think that Spark wouldn't be able to find Hive config)
- Pass HDFS configurations using the --hdfsConfigurations options, but it's inconvenient when there are a lot of config to set, and they would not be changed automatically when reconfigured in Cloudera Manager
In my opinion, to fix this the HadoopFileSystemOptions class should split the content of the HADOOP_CONF_DIR environment variable by colon (":") to detect all paths contained.
I have already fixed this and all tests on class HadoopFileSystemOptions pass successfully. I'm preparing a pull request.