Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-9315

HadoopFileSystemOptions unable to interpret HADOOP_CONF_DIR with multiple paths

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.19.0
    • 2.20.0
    • None
    • Cloudera CDH 6.3.2 with Spark 2.4.0 (Scala 2.11)

    Description

      In certain Hadoop deployments the HADOOP_CONF_DIR environment variable could contain multiple paths. For example, when running spark-submit Cloudera 6.3 sets it as follows:

      HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/conf/yarn-conf:/etc/hive/conf

      Currently the class HadoopFileSystemOptions reads the content of the variable but treats it as a single path. When it contains multiple paths, this makes Beam unable to properly configure Hadoop, and so HDFS can't be accessed. At the moment, the only work arounds to make it work that I'm aware of are:

      • Override the HADOOP_CONF_DIR set by Cloudera for the Spark service, but I think it could cause problems with some other tools (maybe when using Hive from Spark, because I think that Spark wouldn't be able to find Hive config)
      • Pass HDFS configurations using the --hdfsConfigurations options, but it's inconvenient when there are a lot of config to set, and they would not be changed automatically when reconfigured in Cloudera Manager

      In my opinion, to fix this the HadoopFileSystemOptions class should split the content of the HADOOP_CONF_DIR environment variable by colon (":") to detect all paths contained.

      I have already fixed this and all tests on class HadoopFileSystemOptions pass successfully. I'm preparing a pull request.

       

      Attachments

        Issue Links

          Activity

            People

              claventu Claudio Venturini
              claventu Claudio Venturini
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m