Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40499

Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 3.2.1
    • None
    • Shuffle
    • None
    • hadoop: 3.0.0 

      spark:  2.4.0 / 3.2.1

      shuffle:spark 2.4.0

    Description

      spark.sql(
            s"""
               |SELECT
               | Info ,
               | PERCENTILE_APPROX(cost,0.5) cost_p50,
               | PERCENTILE_APPROX(cost,0.9) cost_p90,
               | PERCENTILE_APPROX(cost,0.95) cost_p95,
               | PERCENTILE_APPROX(cost,0.99) cost_p99,
               | PERCENTILE_APPROX(cost,0.999) cost_p999
               |FROM
               | textData
               |""".stripMargin)

      • When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. 
      • If we upgrade the Shuffle, will we get performance regression?
      •  

      Attachments

        1. Screenshot 2024-01-05 at 3.51.52 PM.png
          43 kB
          Joey Pereira
        2. Screenshot 2024-01-05 at 3.53.10 PM.png
          28 kB
          Joey Pereira
        3. spark3.2-shuffle-data.png
          39 kB
          xuanzhiang
        4. spark2.4-shuffle-data.png
          16 kB
          xuanzhiang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              xuanzhiang xuanzhiang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: