Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31430

Bug in the approximate quantile computation.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.0.0
    • None
    • SQL
    • None

    Description

      I am seeing a bug where passing lower relative error to the approxQuantile function is leading to incorrect result in the presence of partitions. Setting a relative error 1e-6 causes it to compute equal values for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct results. This issue was not present in spark version 2.4.5, we noticed it when testing 3.0.0-preview.

      >>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', header=True, schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))
      >>> df = df.repartition(200, 'Store').localCheckpoint()
      >>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)
      [1422576000.0, 1430352000.0, 1438300800.0]
      >>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.00001)
      [1422576000.0, 1430524800.0, 1438300800.0]
      >>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.000001)
      [1422576000.0, 1438300800.0, 1438300800.0]
      >>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.000001)
      [1422576000.0, 1430524800.0, 1438300800.0]

      Attachments

        1. approx_quantile_data.csv
          12.08 MB
          Siddartha Naidu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              siddarthan Siddartha Naidu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: