Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8087

[C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

Details

    Description

      Currently, when reading a partitioned dataset with hive partitioning, it seems that the partition columns get sorted alphabetically when appending them to the schema (while the old ParquetDataset implementation keeps the order as it is present in the paths).
      For a regular partitioning this order is consistent for all fragments.

      So for example for the typical NYC Taxi data example, with datasets, the schema ends with columns "month, year", while the ParquetDataset appends them as "year, month".

      Python example:

      foo_keys = [0, 1]
      bar_keys = ['a', 'b', 'c']
      N = 30
      
      df = pd.DataFrame({
          'foo': np.array(foo_keys, dtype='i4').repeat(15),
          'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
          'values': np.random.randn(N)
      })
      
      pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
      
      >>> pq.read_table("test_order").schema
      values: double
      foo: dictionary<values=int64, indices=int32, ordered=0>
      bar: dictionary<values=string, indices=int32, ordered=0>
      
      >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
      values: double
      bar: string
      foo: int32
      

      so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something else)

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m

                  Slack

                    Issue deployment