Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Testing new feature from ARROW-8647, python test that reproduces it:
@pytest.mark.parquet @pytest.mark.parametrize('partitioning', ["directory", "hive"]) def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning): import pyarrow.parquet as pq table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5}) path = tempdir / "dataset" path.mkdir() for part in ["A", "B", "C"]: fmt = "{}" if partitioning == "directory" else "part={}" part = path / fmt.format(part) part.mkdir() pq.write_table(table, part / "test.parquet") if partitioning == "directory": part = ds.DirectoryPartitioning.discover(["part"], max_partition_dictionary_size=-1) else: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) dataset = ds.dataset(str(path), partitioning=part) expected_schema = table.schema.append( pa.field("part", pa.dictionary(pa.int32(), pa.string())) ) assert dataset.schema.equals(expected_schema)
This test fails (segfaults) for HivePartitioning, but works for DirectoryPartitioning
Attachments
Issue Links
- is related to
-
ARROW-8647 [C++][Dataset] Optionally encode partition field values as dictionary type
- Resolved
- links to