Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.
Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063).
Example in R:
> df <- data.frame(x = c(2, 4, 1, NA, 3)) # as ian says on ARROW-14085, dplyr puts NAs last in sorting, always > df %>% arrange(x) x 1 1 2 2 3 3 4 4 5 NA > df %>% arrange(desc(x)) x 1 4 2 3 3 2 4 1 5 NA # So when you arrange %>% head/tail, you get values based on that: > df %>% arrange(x) %>% head(1) x 1 1 > df %>% arrange(x) %>% tail(1) x 5 NA # We sort like that in arrow too: > df %>% arrow_table() %>% arrange(x) %>% collect() x 1 1 2 2 3 3 4 4 5 NA > df %>% arrow_table() %>% arrange(desc(x)) %>% collect() x 1 4 2 3 3 2 4 1 5 NA # But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)), # we don't get the same tail value: > df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect() x 1 1 > df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect() x 1 4