Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11439

PERCENTILE_APPROX is inconsistent with PERCENTILE even for a small number of data points

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • Hide
      The PERCENTILE aggregation function uses simple interpolation between values at positions $\floor{p * (n - 1)}$ and $\ceil{p * (n - 1)}$, where n is the total number of values:
      https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L95
      (position calculation: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L218) By the way, the javadoc references the NIST method (https://en.wikipedia.org/wiki/Percentile#NIST_method) but the implementation seems to be a modification of the NIST method, multiplying the requested quantile value by (n - 1) instead of (n + 1) as the NIST method says.

      The above implementation is inconsistent with what PERCENTILE_APPROX does (https://github.com/apache/hive/blob/20034d3ef223c249aa72ffa315238af6c7157afd/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java#L264), leading to a discrepancy between PERCENTILE and PERCENTILE_APPROX even for a small number of values. PERCENTILE_APPROX is supposed to be precise for a small number of values (less than the number of buckets in the histogram).

      The proposal of this JIRA issue is to fix PERCENTILE_APPROX's quantile interpolation in NumericHistogram to be consistent with the modified NIST method used by the PERCENTILE function.
      Show
      The PERCENTILE aggregation function uses simple interpolation between values at positions $\floor{p * (n - 1)}$ and $\ceil{p * (n - 1)}$, where n is the total number of values: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L95 (position calculation: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L218) By the way, the javadoc references the NIST method ( https://en.wikipedia.org/wiki/Percentile#NIST_method ) but the implementation seems to be a modification of the NIST method, multiplying the requested quantile value by (n - 1) instead of (n + 1) as the NIST method says. The above implementation is inconsistent with what PERCENTILE_APPROX does ( https://github.com/apache/hive/blob/20034d3ef223c249aa72ffa315238af6c7157afd/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java#L264), leading to a discrepancy between PERCENTILE and PERCENTILE_APPROX even for a small number of values. PERCENTILE_APPROX is supposed to be precise for a small number of values (less than the number of buckets in the histogram). The proposal of this JIRA issue is to fix PERCENTILE_APPROX's quantile interpolation in NumericHistogram to be consistent with the modified NIST method used by the PERCENTILE function.

    Attachments

      Activity

        People

          mikhail Mikhail Gryzykhin
          mikhail Mikhail Gryzykhin
          Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

            Created:
            Updated: