Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.16.0
Description
from pyarrow.fs import HadoopFileSystem
import pyarrow.parquet as pq
file_name = "hdfs://localhost:9000/test/file_name.pq"
hdfs, path = HadoopFileSystem.from_uri(file_name)
dataset = pq.ParquetDataset(file_name, filesystem=hdfs)
has error:
OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>
When I tried using the deprecated HadoopFileSystem:
import pyarrow
import pyarrow.parquet as pq
file_name = "hdfs://localhost:9000/test/file_name.pq"
hdfs = pyarrow.hdfs.connect('localhost', 9000)
dataset = pq.ParquetDataset(file_names, filesystem=hdfs)
pa_schema = dataset.schema.to_arrow_schema()
pieces = dataset.pieces
for piece in pieces:
print(piece.path)
piece.path lose the hdfs://localhost:9000 prefix.
I think ParquetDataset should accept pyarrow.fs.HadoopFileSystem as filesystem?
And piece.path should have the prefix?
Attachments
Issue Links
- links to