Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
None
Description
I assume it is an issue on the C++ side of the datasets code, but reproducer in Python.
I create a small parquet file with a single column of dictionary type. Reading it with pq.read_table works fine, reading it with the datasets machinery hangs when scanning:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)}) arrow_table = pa.Table.from_pandas(df) filename = "test.parquet" pq.write_table(arrow_table, filename) from pyarrow.fs import LocalFileSystem from pyarrow.dataset import ParquetFileFormat, Dataset, FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions filesystem = LocalFileSystem() format = ParquetFileFormat() options = FileSystemDiscoveryOptions() discovery = FileSystemDataSourceDiscovery( filesystem, [filename], format, options) inspected_schema = discovery.inspect() dataset = Dataset([discovery.finish()], inspected_schema) # dataset.schema works fine and gives correct schema dataset.schema scanner_builder = dataset.new_scan() scanner = scanner_builder.finish() # this hangs scanner.to_table()
Attachments
Issue Links
- duplicates
-
ARROW-6895 [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`
- Resolved