Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13141

[C++][Python] HadoopFileSystem: automatically set CLASSPATH based on HADOOP_HOME env variable?

    XMLWordPrintableJSON

Details

    Description

      In the "legacy" python-specific HadoopFileSystem implementation, we have a _maybe_set_hadoop_classpath function which has some logic to set the CLASSPATH environment variable based on HADOOP_HOME or the hadoop executable: https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/hdfs.py#L134-L149

      This is also mentioned in the documentation of the new HadoopFileSystem (https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs ):

      > If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

      However, this sentence was probably simply copied over from the docs about the legacy filesystem. And for the new HadoopFileSystem implementation, we don't have this logic to automatically set up CLASSPATH.

      Do we want to add this logic to the new implementation as well? (in cython, or actually in C++?) Or if not, we should update the docs to clarify that CLASSPATH is actually required.

      cc apitrou

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m