[ARROW-16630] [C++] Proper BottomK support in SinkNode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/20255

Description

BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.

Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (~~ARROW-12063~~).

Example in R:

> df <- data.frame(x = c(2, 4, 1, NA, 3))
# as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
> df %>% arrange(x)
   x
1  1
2  2
3  3
4  4
5 NA
> df %>% arrange(desc(x))
   x
1  4
2  3
3  2
4  1
5 NA

# So when you arrange %>% head/tail, you get values based on that:
> df %>% arrange(x) %>% head(1)
  x
1 1
> df %>% arrange(x) %>% tail(1)
   x
5 NA

# We sort like that in arrow too:
> df %>% arrow_table() %>% arrange(x) %>% collect()
   x
1  1
2  2
3  3
4  4
5 NA
> df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
   x
1  4
2  3
3  2
4  1
5 NA

# But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
# we don't get the same tail value:
> df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
  x
1 1
> df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
  x
1 4

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Neal Richardson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/May/22 14:56

Updated:: 11/Jan/23 11:45