Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37099

Introduce a rank-based filter to optimize top-k computation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • SQL
    • None

    Description

      in JD, we found that more than 90% usage of window function follows this pattern:

       select (... (row_number|rank|dense_rank) () over( [partition by ...] order by ... ) as rn)
          where rn (==|<|<=) k and other conditions

       

      However, existing physical plan is not optimum:

       

      1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount;

       

      For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset  is always <=  its final rank computed on the whole dataset. so we can safely discard rows with partitial rank > k, anywhere.

       

       

      2, skewed-window: some partition is skewed and take a long time to finish computation.

       

      A real-world skewed-window case in our system is attached.

       

      Attachments

        1. skewed_window.png
          157 kB
          Ruifeng Zheng
        2. q67.png
          789 kB
          Ruifeng Zheng
        3. q67_optimized.png
          829 kB
          Ruifeng Zheng

        Activity

          People

            beliefer Jiaan Geng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: