Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7957

[Python] ParquetDataset cannot take HadoopFileSystem as filesystem

    XMLWordPrintableJSON

Details

    Description

      from pyarrow.fs import HadoopFileSystem
      import pyarrow.parquet as pq

       

      file_name = "hdfs://localhost:9000/test/file_name.pq"
      hdfs, path = HadoopFileSystem.from_uri(file_name)
      dataset = pq.ParquetDataset(file_name, filesystem=hdfs)

       

      has error:
      OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>

       

      When I tried using the deprecated HadoopFileSystem:

      import pyarrow
      import pyarrow.parquet as pq

       

      file_name = "hdfs://localhost:9000/test/file_name.pq"

      hdfs = pyarrow.hdfs.connect('localhost', 9000)

      dataset = pq.ParquetDataset(file_names, filesystem=hdfs)

      pa_schema = dataset.schema.to_arrow_schema()

      pieces = dataset.pieces

      for piece in pieces: 

          print(piece.path)

       

      piece.path lose the hdfs://localhost:9000 prefix.

       

      I think ParquetDataset should accept pyarrow.fs.HadoopFileSystem as filesystem?

      And piece.path should have the prefix?

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              cat-yu Catherine
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m