[ARROW-8087] [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24297

Description

Currently, when reading a partitioned dataset with hive partitioning, it seems that the partition columns get sorted alphabetically when appending them to the schema (while the old ParquetDataset implementation keeps the order as it is present in the paths).
For a regular partitioning this order is consistent for all fragments.

So for example for the typical NYC Taxi data example, with datasets, the schema ends with columns "month, year", while the ParquetDataset appends them as "year, month".

Python example:

foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30

df = pd.DataFrame({
    'foo': np.array(foo_keys, dtype='i4').repeat(15),
    'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
    'values': np.random.randn(N)
})

pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])

>>> pq.read_table("test_order").schema
values: double
foo: dictionary<values=int64, indices=int32, ordered=0>
bar: dictionary<values=string, indices=int32, ordered=0>

>>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
values: double
bar: string
foo: int32

so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something else)

Attachments

Issue Links

links to

GitHub Pull Request #6594

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Mar/20 12:15

Updated:: 11/Jan/23 07:58

Resolved:: 17/Mar/20 02:49

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m