Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24013

ApproximatePercentile grinds to a halt on sorted input.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      Running

      sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from range(10000000))").collect()
      

      takes 7 seconds, while

      sql("select approx_percentile(id, array(0.1)) from range(10000000)").collect()
      

      grinds to a halt - processes the first million rows quickly, and then slows down to a few thousands rows / second (4m rows processed after 20 minutes).

      Thread dumps show that it spends time in QuantileSummary.compress.
      Seems it hits some edge case inefficiency when dealing with sorted data?

      Attachments

        1. screenshot-1.png
          191 kB
          Juliusz Sompolski

        Activity

          People

            mgaido Marco Gaido
            juliuszsompolski Juliusz Sompolski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: