[ARROW-9226] [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.17.1
Fix Version/s: 6.0.0
Component/s: Python
Labels:
- hdfs
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25324

Description

'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from the hadoop configuration files.

The new pyarrow.fs.HadoopFileSystem requires the host to be specified.

Inferring this info from "the environment" makes it easier to deploy pipelines.

But more important, for HA namenodes it is almost impossible to know for sure what to specify. If a rolling restart is ongoing, the namenode is changing. There is no guarantee on which will be active in a HA setup.

I tried connecting to the standby namenode. The connection gets established, but when writing a file an error is raised that standby namenodes are not allowed to write to.

Attachments

Issue Links

supercedes

ARROW-448 [Python] Load HdfsClient default options from core-site.xml or hdfs-site.xml, if available

Closed

links to

GitHub Pull Request #10917

Activity

People

Assignee:: Itamar Turner-Trauring

Reporter:: Bruno Quinart

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 25/Jun/20 11:50

Updated:: 11/Jan/23 08:05

Resolved:: 12/Aug/21 18:29

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m