[HIVE-11439] PERCENTILE_APPROX is inconsistent with PERCENTILE even for a small number of data points - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Release Note:

Hide
The PERCENTILE aggregation function uses simple interpolation between values at positions $\floor{p * (n - 1)}$ and $\ceil{p * (n - 1)}$, where n is the total number of values:
https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L95
(position calculation: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L218) By the way, the javadoc references the NIST method (https://en.wikipedia.org/wiki/Percentile#NIST_method) but the implementation seems to be a modification of the NIST method, multiplying the requested quantile value by (n - 1) instead of (n + 1) as the NIST method says.

The above implementation is inconsistent with what PERCENTILE_APPROX does (https://github.com/apache/hive/blob/20034d3ef223c249aa72ffa315238af6c7157afd/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java#L264), leading to a discrepancy between PERCENTILE and PERCENTILE_APPROX even for a small number of values. PERCENTILE_APPROX is supposed to be precise for a small number of values (less than the number of buckets in the histogram).

The proposal of this JIRA issue is to fix PERCENTILE_APPROX's quantile interpolation in NumericHistogram to be consistent with the modified NIST method used by the PERCENTILE function.

Show
The PERCENTILE aggregation function uses simple interpolation between values at positions $\floor{p * (n - 1)}$ and $\ceil{p * (n - 1)}$, where n is the total number of values: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L95 (position calculation: https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java#L218) By the way, the javadoc references the NIST method ( https://en.wikipedia.org/wiki/Percentile#NIST_method ) but the implementation seems to be a modification of the NIST method, multiplying the requested quantile value by (n - 1) instead of (n + 1) as the NIST method says. The above implementation is inconsistent with what PERCENTILE_APPROX does ( https://github.com/apache/hive/blob/20034d3ef223c249aa72ffa315238af6c7157afd/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java#L264), leading to a discrepancy between PERCENTILE and PERCENTILE_APPROX even for a small number of values. PERCENTILE_APPROX is supposed to be precise for a small number of values (less than the number of buckets in the histogram). The proposal of this JIRA issue is to fix PERCENTILE_APPROX's quantile interpolation in NumericHistogram to be consistent with the modified NIST method used by the PERCENTILE function.

Attachments

Activity

People

Assignee:: Mikhail Gryzykhin

Reporter:: Mikhail Gryzykhin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Aug/15 09:52

Updated:: 22/Mar/21 13:08