[ARROW-13141] [C++][Python] HadoopFileSystem: automatically set CLASSPATH based on HADOOP_HOME env variable? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: C++, Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/18719

Description

In the "legacy" python-specific HadoopFileSystem implementation, we have a _maybe_set_hadoop_classpath function which has some logic to set the CLASSPATH environment variable based on HADOOP_HOME or the hadoop executable: https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/hdfs.py#L134-L149

This is also mentioned in the documentation of the new HadoopFileSystem (https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs ):

> If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

However, this sentence was probably simply copied over from the docs about the legacy filesystem. And for the new HadoopFileSystem implementation, we don't have this logic to automatically set up CLASSPATH.

Do we want to add this logic to the new implementation as well? (in cython, or actually in C++?) Or if not, we should update the docs to clarify that CLASSPATH is actually required.

cc apitrou

Attachments

Issue Links

links to

GitHub Pull Request #10592

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Jun/21 09:28

Updated:: 11/Jan/23 08:31

Resolved:: 13/Oct/21 21:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m