Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24013

ApproximatePercentile grinds to a halt on sorted input.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Running

      sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from range(10000000))").collect()
      

      takes 7 seconds, while

      sql("select approx_percentile(id, array(0.1)) from range(10000000)").collect()
      

      grinds to a halt - processes the first million rows quickly, and then slows down to a few thousands rows / second (4m rows processed after 20 minutes).

      Thread dumps show that it spends time in QuantileSummary.compress.
      Seems it hits some edge case inefficiency when dealing with sorted data?

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              juliuszsompolski Juliusz Sompolski

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment