Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10872

[Python] pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • Python

    Description

      It's not possible to open a `abfs://` or `abfss://` URI with the pyarrow.fs.HadoopFileSystem.

      Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).

      Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:

      • pyarrow.hdfs.HadoopFileSystem(host="abfs://xxx@xxx.dfs.core.windows.net")
      • pyarrow.hdfs.connect(host="abfs://xxx@xxx.dfs.core.windows.net")

      and I believe the new interface should work too by passing the full URI as "host" to `pyarrow.fs.HadoopFileSystem` constructor. However, the constructor wrongly prepends "hdfs://" at the beginning: https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/python/pyarrow/_hdfs.pyx#L64

      Attachments

        Activity

          People

            Unassigned Unassigned
            jjgalvez Juan Galvez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: