[ARROW-7957] [Python] ParquetDataset cannot take HadoopFileSystem as filesystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.16.0
Fix Version/s: 2.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24176

Description

from pyarrow.fs import HadoopFileSystem
import pyarrow.parquet as pq

file_name = "hdfs://localhost:9000/test/file_name.pq"
hdfs, path = HadoopFileSystem.from_uri(file_name)
dataset = pq.ParquetDataset(file_name, filesystem=hdfs)

has error:
OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>

When I tried using the deprecated HadoopFileSystem:

import pyarrow
import pyarrow.parquet as pq

file_name = "hdfs://localhost:9000/test/file_name.pq"

hdfs = pyarrow.hdfs.connect('localhost', 9000)

dataset = pq.ParquetDataset(file_names, filesystem=hdfs)

pa_schema = dataset.schema.to_arrow_schema()

pieces = dataset.pieces

for piece in pieces:

print(piece.path)

piece.path lose the hdfs://localhost:9000 prefix.

I think ParquetDataset should accept pyarrow.fs.HadoopFileSystem as filesystem?

And piece.path should have the prefix?

Attachments

Issue Links

links to

GitHub Pull Request #8414

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Catherine

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Feb/20 18:57

Updated:: 11/Jan/23 07:57

Resolved:: 10/Oct/20 19:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m