Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13573

[C++] Support dictionaries directly in case_when kernel

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      case_when (and other similar kernels) currently dictionary-decode inputs, then operate on the decoded values. This is both inefficient and unexpected. We should instead operate directly on dictionary indices.

      Of course, this introduces more edge cases. If the dictionaries of inputs do not match, we have the following choices:

      1. Raise an error.
      2. Unify the dictionaries.
      3. Use one of the dictionaries, and raise an error if an index of another dictionary cannot be mapped to an index of the chosen dictionary.
      4. Use one of the dictionaries, and emit null if an index of another dictionary cannot be mapped to an index of the chosen dictionary. (This is what base dplyr if_else does with factors.)

      All of these options are reasonable, so we should introduce an options struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly necessary, as the user can unify the dictionaries manually first, but it may be more efficient to do it this way. Similarly, #1 isn't strictly necessary.

      #3 and #4 are justifiable (beyond just "it's what R does") since users may filter down disjoint dictionaries into a set of common values and then expect to combine the remaining values with a kernel like case_when.

      As described on GitHub.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lidavidm David Li
            lidavidm David Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 4h 20m
                4h 20m

                Issue deployment