Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24013

ApproximatePercentile grinds to a halt on sorted input.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Running

      sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from range(10000000))").collect()
      

      takes 7 seconds, while

      sql("select approx_percentile(id, array(0.1)) from range(10000000)").collect()
      

      grinds to a halt - processes the first million rows quickly, and then slows down to a few thousands rows / second (4m rows processed after 20 minutes).

      Thread dumps show that it spends time in QuantileSummary.compress.
      Seems it hits some edge case inefficiency when dealing with sorted data?

        Attachments

        1. screenshot-1.png
          191 kB
          Juliusz Sompolski

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              juliuszsompolski Juliusz Sompolski
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: