Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5922

[Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Works for Me
    • 0.14.0
    • 0.14.0
    • Python
    • None
    • Unix

    Description

      Here's what I'm trying:

      ```

      {{import pyarrow as pa }}

      {{conf = {"hadoop.security.authentication": "kerberos"} }}

      fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)

      ```

      However, when I submit this job to the cluster using Dask-YARN, I get the following error:

      ```

      File "test/run.py", line 3 fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf) File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in _init_ File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed

      ```

      I also tried setting host (to a name node) and port (=8020), however I run into the same error. Since the error is not descriptive, I'm not sure which setting needs to be altered. Any clues anyone?

      Attachments

        Activity

          People

            Unassigned Unassigned
            sbajaj Saurabh Bajaj
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: