Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16630

[C++] Proper BottomK support in SinkNode

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.

      Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063).

      Example in R:

      > df <- data.frame(x = c(2, 4, 1, NA, 3))
      # as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
      > df %>% arrange(x)
         x
      1  1
      2  2
      3  3
      4  4
      5 NA
      > df %>% arrange(desc(x))
         x
      1  4
      2  3
      3  2
      4  1
      5 NA
      
      # So when you arrange %>% head/tail, you get values based on that:
      > df %>% arrange(x) %>% head(1)
        x
      1 1
      > df %>% arrange(x) %>% tail(1)
         x
      5 NA
      
      # We sort like that in arrow too:
      > df %>% arrow_table() %>% arrange(x) %>% collect()
         x
      1  1
      2  2
      3  3
      4  4
      5 NA
      > df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
         x
      1  4
      2  3
      3  2
      4  1
      5 NA
      
      # But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
      # we don't get the same tail value:
      > df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
        x
      1 1
      > df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
        x
      1 4
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            npr Neal Richardson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: