Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10099

[C++][Dataset] Also allow integer partition fields to be dictionary encoded

    XMLWordPrintableJSON

Details

    Description

      In ARROW-8647, we added the option to indicate that you partition field columns should be dictionary encoded, but it currently does only do this for string type, and not for integer type (wiht the reasoning that for integers, it is not giving any memory efficiency gains to use dictionary encoding).

      In dask, they have been using categorical dtypes for all partition fields, also if they are integers. They would like to keep doing this (apart from memory efficiency, using categorical/dictionary type also gives information about all uniques values of the column, without having to calculate this), so it would be nice to enable this use case.

      So I think we could either simply always dictionary encode also integers when max_partition_dictionary_size indicates partition fields should be dictionary encoded, or either have an additional option to indicate also integer partition fields should be encoded (if the other option indicates dictionary encoding should be used).

      Based on feedback from the dask PR using the dataset API at https://github.com/dask/dask/pull/6534#issuecomment-698723009

      cc rjzamora bkietz

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m