[ARROW-13573] [C++] Support dictionaries directly in case_when kernel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/29220

Description

case_when (and other similar kernels) currently dictionary-decode inputs, then operate on the decoded values. This is both inefficient and unexpected. We should instead operate directly on dictionary indices.

Of course, this introduces more edge cases. If the dictionaries of inputs do not match, we have the following choices:

Raise an error.
Unify the dictionaries.
Use one of the dictionaries, and raise an error if an index of another dictionary cannot be mapped to an index of the chosen dictionary.
Use one of the dictionaries, and emit null if an index of another dictionary cannot be mapped to an index of the chosen dictionary. (This is what base dplyr if_else does with factors.)

All of these options are reasonable, so we should introduce an options struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly necessary, as the user can unify the dictionaries manually first, but it may be more efficient to do it this way. Similarly, #1 isn't strictly necessary.

#3 and #4 are justifiable (beyond just "it's what R does") since users may filter down disjoint dictionaries into a set of common values and then expect to combine the remaining values with a kernel like case_when.

As described on GitHub.

Attachments

Issue Links

is related to

ARROW-14042 [C++] Improve performance on dictionaries for 'case_when' kernel

Open

ARROW-14105 [C++] Reconcile type promotion rules between if_else, case_when, coalesce, select

Open

ARROW-14177 [C++] Optimize dictionary support in kernels/Support nulls in DictionaryUnifier

Open

relates to

ARROW-13222 [C++] Support variable-width types in case_when function

Resolved

links to

GitHub Pull Request #11022

Activity

People

Assignee:: David Li

Reporter:: David Li

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Aug/21 14:29

Updated:: 11/Jan/23 08:34

Resolved:: 21/Sep/21 11:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 20m