Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8647

[C++][Dataset] Optionally encode partition field values as dictionary type

    XMLWordPrintableJSON

Details

    Description

      In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns.

      In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in Partitioning passed to the dataset factory).

      Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields.

      See also https://github.com/apache/arrow/pull/6303#discussion_r400622340

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m