[ARROW-7545] [C++] [Dataset] Scanning dataset with dictionary type hangs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: C++
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/23806

Description

I assume it is an issue on the C++ side of the datasets code, but reproducer in Python.

I create a small parquet file with a single column of dictionary type. Reading it with pq.read_table works fine, reading it with the datasets machinery hangs when scanning:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
arrow_table = pa.Table.from_pandas(df)

filename = "test.parquet"
pq.write_table(arrow_table, filename)

from pyarrow.fs import LocalFileSystem
from pyarrow.dataset import ParquetFileFormat, Dataset, FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions

filesystem = LocalFileSystem()
format = ParquetFileFormat()
options = FileSystemDiscoveryOptions()

discovery = FileSystemDataSourceDiscovery(
        filesystem, [filename], format, options)
inspected_schema = discovery.inspect()
dataset = Dataset([discovery.finish()], inspected_schema)

# dataset.schema works fine and gives correct schema
dataset.schema

scanner_builder = dataset.new_scan()
scanner = scanner_builder.finish()
# this hangs
scanner.to_table()

Attachments

Issue Links

duplicates

ARROW-6895 [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

Resolved

Activity

People

Assignee:: Francois Saint-Jacques

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jan/20 10:12

Updated:: 11/Jan/23 07:54

Resolved:: 14/Jan/20 20:25