Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Currently, when reading a partitioned dataset with hive partitioning, it seems that the partition columns get sorted alphabetically when appending them to the schema (while the old ParquetDataset implementation keeps the order as it is present in the paths).
For a regular partitioning this order is consistent for all fragments.
So for example for the typical NYC Taxi data example, with datasets, the schema ends with columns "month, year", while the ParquetDataset appends them as "year, month".
Python example:
foo_keys = [0, 1] bar_keys = ['a', 'b', 'c'] N = 30 df = pd.DataFrame({ 'foo': np.array(foo_keys, dtype='i4').repeat(15), 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), 'values': np.random.randn(N) }) pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
>>> pq.read_table("test_order").schema values: double foo: dictionary<values=int64, indices=int32, ordered=0> bar: dictionary<values=string, indices=int32, ordered=0> >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema values: double bar: string foo: int32
so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something else)
Attachments
Issue Links
- links to